Is WCF “Straightforward” for Long Running Tasks?

My father sent me a link to this article on SOA scalability. He thought it was pretty good until he got to this paragraph:

Long-running tasks become more complex. You cannot assume that your client can maintain a consistent connection to your web service throughout the life of a task that takes 15 minutes, much less one hour or two days. In this case, you need to implement a solution that follows a full-duplex pattern (where your client is also a service and gets notified when the task is completed) or a polling scheme (where your client checks back later to get the results). Both of these solutions require stateful services. This full-duplex pattern becomes straightforward to implement using the Windows Communications Foundation (Indigo) included with .NET 3.0.

When I first saw duplex channels in WCF, I figured you can use them for long running tasks also. Turns out that of the nine standard WCF bindings, only four support duplex contracts. Of those four, one is designed for peer-to-peer scenarios and one uses named pipes so it doesn’t work across the network, so they’re obviously not usable in the article’s scenario. NetTcp can only provide duplex contracts within the scope of a consistent connection, which the author has already ruled out as a solution. That leaves wsDualHttp, which is implemented much as the author describes, where both client and the service are listening on the network for messages. There’s even a standard binding element – Composite Duplex – which ties two one-way messaging channels into a duplex channel.

Alas, the wsDualHttp solution has a few flaws that render it – in my opinion at least – unusable for exactly these sorts of long running scenarios. On the client side, while you can specify the ClientBaseAddress, you can’t specify the entire ListenUri. Instead, wsDualHttp generates a random guid and tacks it on the end of your ClientBaseAddress, effectively creating a random url every time you run the client app. So if you shut down and restart your client app, you’re now listening on a different url than the one the service is going to send messages to and the connection is broken. Oops.

The issues don’t end there. On the service side of a duplex contract, you get an object you can use to call back to the client via OperationContext.Current.GetCallbackChannel. This works fine, as long as you don’t have to shut down your service. There’s no way to persist the callback channel information to disk and later recreate it. So if you shut down and restart your service, there’s no way to reconnect with the client, even if they haven’t changed the url they’re listening on. Oops.

So in other words, WCF can do long running services using the wsDualHttp binding, as long as you don’t restart the client or service during the conversation. Because that would never ever happen, right?

This is part of the reason why I’m sold on Service Broker. From where I sit, it looks like WCF can’t handle long running operations at all – at least, not with any of the built in transports and bindings. You may be able to build something custom that would work for long running services, I’m not a deep enough expert on WCF to know. From reading what Nicholas Allen has to say about CompositeDuplex, I’m fairly sure you could work around the client url issue if you built a custom binding element to set the ListenUriBaseAddress. But I have no idea how to deal with the service callback channel issue. It doesn’t appear that the* *necessary plumbing is there at all to persist and rehydrate the callback channel. If you can’t do that, I don’t see how you can reliably support long running services.

The Other Foundation Technology

I mentioned last week that WF “is one of two foundation technologies that my project absolutely depends on”. Sam Gentile assumes the other foundation technology is WCF. It’s not.

As a quick reminder, my day job these days is to architect and deliver shared service-oriented infrastructure for Microsoft’s IT division. These services will be automating long running business operations. And when I say long running, I mean days, weeks or longer. While there will surely be some atomic or stateless services, I expect most of the services we build will be long running. Thus, the infrastructure I’m responsible for has to enable and support long running services.

The other foundation technology my project depends on is Service Broker. Service Broker was expressly designed for building these types of long running services. It supports several capabilities that I consider absolutely critical for long running services:

  • Service Initiated Interaction. Polling for changes is inefficient. Long running operations need support for the Solicit-Response and/or Notification message exchange patterns.
  • Durable Messaging. The first fallacy of distributed computing is that the network is reliable. If you need to be 100% sure the message gets delivered, you have to write it to disk on both sides.
  • Service Instance Dehydration. It’s both dangerous and inefficient to keep an instance of a long running service in memory when it’s idle. In order to maximize integrity (i.e. service instances survive a system crash) as well as resource utilization (i.e. we’re not wasting memory/CPU/etc on idle service instances), service instances must be dehydrated to disk.

In addition to these capabilities, Service Broker supports something called Conversation Group Locking, which turns out to be important when building highly scalable long running services. Furthermore, my understanding is that Conversation Group Locking is a feature unique to Service Broker, not only across Microsoft’s products but across the industry. Basically, it means that inbound messages for a specific long running service instance are locked so they can’t be processed on more than one thread at a time.

Here’s an example: let’s say I’m processing a Cancel Order message for a specific order when the Ready to Ship message arrives for that order arrives. With Conversation Group Locking, the Ready to Ship message stays locked in the queue until the Cancel Order message transaction is complete, regardless of the number of service threads there are. Without Conversation Group Locking, the Ready to Ship message might get processed by another service thread at the same time the Cancel Order message is being processed. The customer might get notified that the cancellation succeeded while the shipping service gets notified to ship the product. Oops.

There’s also an almost-natural fit between Service Broker and Windows Workflow. For example, a Service Broker Conversation Group and a WorkflowInstance are roughly analogous. They even both use a Guid for identification, making mapping between Conversation Group and WF Instance simple and direct. I was able to get prototype Service Broker / WF integration up and running in about a day. I’ll post more on that integration later this week.

Last but not least, Service Broker is wicked fast. Unfortunately, I don’t have any public benchmarks to point to, but the Service Broker team told me about a private customer benchmark that handled almost 9,000 messages per second! One of the reasons Service Broker is so fast is because it’s integrated into SQL Server 2005, which is is pretty fast in it’s own right. Since Service Broker is baked right in, you can do all your messaging work and your data manipulation within the scope of a local transaction.

Service Broker has a few rough areas and it lacks a supported managed api (though there is a sample managed api available). Probably the biggest issue is that Service Broker has almost no interop story. If you need to interop with a Service Broker service, you can use SQL Server’s native Web Service support. or the BizTalk adapter for Service Broker from AdapterWORX. However, I’m not sure how many of Service Broker’s native capabilities are exposed if you use these interop mechanisms. You would probably have to write a bunch of application code to make these capabilities work in an interop scenario.

Still, I feel Service Broker’s unique set of capabilities, its natural fit with WF and its high performance make it the best choice for building my project’s long running services. Is it the best choice for your project? I have no idea. One of the benefits of working for MSIT is that I get to focus on solving a specific problem and not on solving general problems. I would say that if you’re doing exclusively atomic or stateless services, Service Broker is probably overkill. If you’re doing any long running services at all, I would at least give Service Broker a serious look.

Lip Service on Long Term Planning

Long term readers know my liberal political leanings. So it should come as no surprise to them that I read liberal blogs like Washington Monthly. But this isn’t a post about politics, it’s a post about planning:

This kind of long-term planning — in politics, in business, in nearly every walk of life — is something that nearly everyone says they support, but when push comes to shove very few people are willing to back it up. There’s always something this week, or this month, or this year that seems uniquely crucial and demands our attention. Next year there will be something else, and the year after that something else again. The long-term stuff simply never gets done unless someone like Dean is willing to go to the mat for it.
[Building a Better MovementKevin Drum]

I don’t have much to add to this, except that planning is a big part of architecture, especially architecture in the enterprise (which may or may not be “Enterprise Architecture”). Who “goes to the mat” for the long-term stuff at your company? Or does the long-term stuff simply never get done?

Thoughts on the SOA Workshop

Last week, I attended an SOA workshop presented by SOA Systems and delivered by “top-selling SOA author” Thomas Erl. It was two SOA-jammed days + the drive to Vancouver and back primarily discussing SOA with Dale. In other words, it was a lot of SOA. I went up expecting to take Erl to task for his “Services are Stateless” principle. However, that turned out to be a misunderstanding on my part about how Erl uses the term stateless. However, while Erl and I agreed on optimizing memory utilization (which is what he means by stateless), that wasn’t much else when it came to common ground. As I wrote last week, Erl’s vision of service-orientation is predicated on unrealistic organizational behavior and offer at best unproven promises of cost and time savings in the long run via black box reuse.

Erl spends a lot of time talking about service reuse. I think it’s safe to say, in Erl’s mind, reuse is the primary value of service orientation. However, he didn’t offer any reason to believe we can reuse services any more successfully than we were able to reuse objects. Furthermore, his predictions about the amount of reuse you can achieve are completely made up. At one point, he was giving actual reuse numbers (i.e. 35% new code, 65% existing code). When I asked him where those numbers came from, Erl admitted that they were “estimates” because “there hasn’t been enough activity in serious SOA projects to provide accurate metrics” and that there is “no short term way of proving” the amount of service reuse. In other words, Erl made those numbers up out of thin air.

This whole “serious” or “real” SOA is a major theme with Erl. One the one hand, I agree that SOA is a horribly overused term. Many projects labeled SOA have little or nothing to do with SO. On the other hand, it seems pretty convenient to chalk up failed projects as not being “real” SOA so you can continue to spout attractive yet completely fictional reuse numbers. I asked about the Gartner’s 20% service reuse prediction and Erl responded that low reuse number was because the WS-* specs are still in process. While I agree that the WS-* specs are critical to the advancement of SO, I fail to see how lack of security, reliable messaging and transactions are holding back reuse. If anything, I would expect those specs to impede reuse, as it adds further contextual requirements to the service.

While I think Erl is mistaken when it comes to the potential for service reuse, he’s absolutely dreaming when it comes to the organizational structure and behavior that has to be in place for this potential service reuse to happen in the first place. I’m not sure what Erl was doing before he became a “top-selling SOA author,” but I find it hard to believe it included any time in any significantly sized IT shop.

Erl expects services – “real” services, anyway – to take around 30% more time and money than he traditional siloed approach. The upside for spending this extra time and money is the potential service reuse. The obvious problem with this is that we don’t know how much reuse we’re going to see for this extra time and money. If you spend 30% more but can only reuse 20% of your services (as Gartner predicts), is it worth it? If you end up spending 50% more but are only able to reuse 10% of your services, is it worth it? Where’s the line where it’s no longer worth it to do SOA? Given that there’s no real way to know how much reuse you’re going to see, Erl’s vision of SOA requires a huge leap of faith on the part of the implementer. “Huge leap of faith” doesn’t go so well with “corporate IT department”.

Furthermore, the next IT project I encounter that is willing to invest any additional time and money – much less 30% – in order to achieve some theoretical organizational benefit down the road will be the first. Most projects I’ve encountered (including inside MSIT) sacrifice long term time and money in return for short term gain. When asked how to make this 30% investment happen, Erl suggested that the CIO has to have a “dictatorial” relationship with the projects in the IT shop. I’m thinking that CIO’s that adopt a dictatorial stance won’t get much cooperation from the IT department and will soon be ex-CIO’s.

In the end, I got a lot less out of this workshop than I was hoping to. As long as SO takes 30% more time and money and the primary benefit is the same retread promises of reuse that OO failed to deliver on, I have a hard time imagining SO making much headway.

PS – I have a barely used copy of “Service-Oriented Architecture: Concepts, Technology, and Design” if anyone wants to trade for it. It’s not a red paperclip, but it’s like new – only flipped through once. 😄

Stateless != Stateless

A while back, I blogged that Services Aren’t Stateless, in response to some stuff in Thomas Erl’s latest book. At the time, I mentioned that I was looking forward to discussing my opinions with Erl when I attended his workshop. I’ve spent the last two days at said workshop. I’ll have a full write-up on the workshop later this week, but I wanted to blog the resolution to this stateless issue right away.

At the time, I wrote “I assume Erl means that service should be stateless in the same way HTTP is stateless.” Turns out, my assumption was way way wrong. When he addressed this topic in his workshop, he started by talking about dealing with concurrency and scalability, which got me confused at first. Apparently, when Erl says stateless, he’s referring to minimizing memory usage. That is, don’t keep service state in memory longer than you need to. So all the stuff about activity data, that’s all fine as per Erl’s principles, as long as you write it out to database instead of keeping it in memory. In his book, he talks about the service being “temporarily stateful” while processing a message. When I read that, I didn’t get it – because I was thinking of the HTTP definition of stateless & stateful. But if we’re just talking about raw memory usage, it suddenly makes sense.

On the one hand, I’m happy to agree 100% with Erl on another of his principles. Frankly, much of what he talked about in his workshop seems predicated on unrealistic organizational behavior and offer at best unproven promises of cost and time savings in the long run via black box reuse. So to be in complete agreement with him on something was a nice change of pace. Thomas is much more interested in long-running and async services than I originally expected when I first flipped thru his book.

On the other hand, calling this out as a “principle of service orientation” hardly seems warranted. I mean, large scale web sites have been doing this for a long time and SQL Session State support has been a part of ASP.NET since v1. Furthermore, using the term “stateless” in this way is fundamentally different from the way HTTP and the industry at large uses it, which was the source of my confusion. So while I agree with the concept, I really wish Erl hadn’t chosen an overloaded term to refer to it.