Is WCF “Straightforward” for Long Running Tasks?

My father sent me a link to this article on SOA scalability. He thought it was pretty good until he got to this paragraph:

Long-running tasks become more complex. You cannot assume that your client can maintain a consistent connection to your web service throughout the life of a task that takes 15 minutes, much less one hour or two days. In this case, you need to implement a solution that follows a full-duplex pattern (where your client is also a service and gets notified when the task is completed) or a polling scheme (where your client checks back later to get the results). Both of these solutions require stateful services. This full-duplex pattern becomes straightforward to implement using the Windows Communications Foundation (Indigo) included with .NET 3.0.

When I first saw duplex channels in WCF, I figured you can use them for long running tasks also. Turns out that of the nine standard WCF bindings, only four support duplex contracts. Of those four, one is designed for peer-to-peer scenarios and one uses named pipes so it doesn’t work across the network, so they’re obviously not usable in the article’s scenario. NetTcp can only provide duplex contracts within the scope of a consistent connection, which the author has already ruled out as a solution. That leaves wsDualHttp, which is implemented much as the author describes, where both client and the service are listening on the network for messages. There’s even a standard binding element – Composite Duplex – which ties two one-way messaging channels into a duplex channel.

Alas, the wsDualHttp solution has a few flaws that render it – in my opinion at least – unusable for exactly these sorts of long running scenarios. On the client side, while you can specify the ClientBaseAddress, you can’t specify the entire ListenUri. Instead, wsDualHttp generates a random guid and tacks it on the end of your ClientBaseAddress, effectively creating a random url every time you run the client app. So if you shut down and restart your client app, you’re now listening on a different url than the one the service is going to send messages to and the connection is broken. Oops.

The issues don’t end there. On the service side of a duplex contract, you get an object you can use to call back to the client via OperationContext.Current.GetCallbackChannel. This works fine, as long as you don’t have to shut down your service. There’s no way to persist the callback channel information to disk and later recreate it. So if you shut down and restart your service, there’s no way to reconnect with the client, even if they haven’t changed the url they’re listening on. Oops.

So in other words, WCF can do long running services using the wsDualHttp binding, as long as you don’t restart the client or service during the conversation. Because that would never ever happen, right?

This is part of the reason why I’m sold on Service Broker. From where I sit, it looks like WCF can’t handle long running operations at all – at least, not with any of the built in transports and bindings. You may be able to build something custom that would work for long running services, I’m not a deep enough expert on WCF to know. From reading what Nicholas Allen has to say about CompositeDuplex, I’m fairly sure you could work around the client url issue if you built a custom binding element to set the ListenUriBaseAddress. But I have no idea how to deal with the service callback channel issue. It doesn’t appear that the* *necessary plumbing is there at all to persist and rehydrate the callback channel. If you can’t do that, I don’t see how you can reliably support long running services.

Custom Authentication with WCF is Top Shelf

I’ve spent the last three days heads down in WCF security and color me massively impressed. I just checked in a prototype that provides customized authentication for a business service. The idea that you could bang up a custom authentication service fairly easily blows my mind.

The cornerstone to this support in WCF is the standard WSFederationHttpBinding. While the binding name implies support for WS-Federation which in turn implies the use of infrastructure like Active Directory Federation Services, the binding also scales down to support simple federation scenarios with a single Security Token Service (aka STS) as defined by WS-Trust. WS-Trust appears similar to Kerberos. If you want to access a service using the federation binding, you first obtain a security token from the associated STS. Tokens contain SAML assertions, which can be standard – such as Name and Windows SID – or entirely custom, which opens up very interesting and flexible security scenarios.

If you want to support multiple authentication systems (windows, certificates, CardSpace, PassportWindows Live ID, etc), STS is perfect because you can centralize the multiple authentication schemes at the STS, which then hands out a standard token the business service understands. Adding a new auth scheme can happen centrally at the STS rather than in each and every service. Support for multiple authentication schemes was the focus of our current prototype and it worked extremely well.

WCF includes a federation sample which is where you should start if you’re interested in this stuff. That scenario includes a chain of two STS’s. Accessing the secure bookstore service requires authenticating against the bookstore STS which in turn requires authenticating against a generic “HomeRealm” STS. Since there are two STS’s, they factored the common STS code into a shared assembly. You can use that common code to build an STS of your own.

For our prototype, we made only minor changes to the common STS code from the sample. In fact, the only significant change we made was to support programmatic selection of the proof key encryption token. In the sample, both the issuer token and the proof key encryption token are hard coded (passed into the base class constructor). The issuer token is used to sign the custom security token so the target service knows it came from the STS. The encryption token is used to – you guessed it – encrypt the token so it can only be used by the target service. Hard-coding the encryption token means you can only use your STS with a single target service. We changed that so the encryption token can be chosen based on the incoming service token request.

Of course, it wasn’t all puppy dogs and ice cream. While I like the config system of WCF, anyone who calls it “easy” is full of it. I’ve spend most of the last three days looking at config files. Funny thing about config files is that they’re hard to debug. So most of my effort over the last few days has been in a cycle of run app / app throws exception / tweak config / repeat. Ugh.

Also, while the federation sample is comprehensive, I wonder why this functionality isn’t in the based WCF platform. For example, the sample includes implementations of RequestSecurityToken and RequestSecurityTokenResponse, the input and output messages of the STS. But I realized that WCF has to have its own implementations of RST and RSTR as well, since it has to send the RST to the STS and process the RSTR it gets in response. A little spelunking revealed the presence of an official WCF implementation of RST and RSTR, both marked internal. I normally fall on the pragmatic side of the internal/public debate, but this one makes little sense to me.

Otherwise, the prototype went smooth as silk and my project teammates were very impressed at how quickly this came together. Several of the project teams we’re working with have identified multiple authentication as the “killer” capability they’re looking to us to provide, so it’s good to know we’re making progress in the right direction.

FeedFlare Finally Fixed

I moved over to FeedBurner a while back. DasBlog has great support for FeedBurner – all you do set your FeedBurner feed name in the DasBlog config and it handles the rest, including permanently redirecting your readers to the new feed.

However, I haven’t been able to make FeedFlares work today. FeedFlares “build interactivity into each post” with little links like “Digg this”, “Email this” or “Add to del.icio.us”. Since FeedBurner is serving the XML feed, it’s no big deal for them to add those links into the RSS feed. But to get those same flares to work on the web site, you have to embed a little script at the end of each item. Scott shows how to do this with DasBlog, except that it didn’t work for me. I’ve tried off and on, but for some reason, the FeedBurner script file I was including was always empty.

Then I noticed the other day that my post WorkflowQueueNames had the flare’s on them. Hmm, why would that post work and none of the rest of mine work? Turns out that it works because there’s no spaces in the title. Unlike most of the rest of the DasBlog community, I’m using ‘+’ for spaces in my permalinks, instead of removing them. So I get http://devhawk.net/FeedFlare+Finally+Fixed.aspx as the permalink url instead of http://devhawk.net/FeedFlareFinallyFixed.aspx. In fact, that feature is in DasBlog because I pushed for it (a fact Scott reminded me of while I was troubleshooting this last night). And it was breaking the FeedFlares.

The solution is to URL encode the ‘+’, which is %2B, in the FeedFlare script link. I created a custom macro, since I already had a few custom macro’s powering this site anyway, and now I get the FeedFlares on all my blog entries. I’ll also go update the DasBlog source, but creating a custom macro was both easier and less risky than patching the tree and upgrading everything.

The Other Foundation Technology

I mentioned last week that WF “is one of two foundation technologies that my project absolutely depends on”. Sam Gentile assumes the other foundation technology is WCF. It’s not.

As a quick reminder, my day job these days is to architect and deliver shared service-oriented infrastructure for Microsoft’s IT division. These services will be automating long running business operations. And when I say long running, I mean days, weeks or longer. While there will surely be some atomic or stateless services, I expect most of the services we build will be long running. Thus, the infrastructure I’m responsible for has to enable and support long running services.

The other foundation technology my project depends on is Service Broker. Service Broker was expressly designed for building these types of long running services. It supports several capabilities that I consider absolutely critical for long running services:

  • Service Initiated Interaction. Polling for changes is inefficient. Long running operations need support for the Solicit-Response and/or Notification message exchange patterns.
  • Durable Messaging. The first fallacy of distributed computing is that the network is reliable. If you need to be 100% sure the message gets delivered, you have to write it to disk on both sides.
  • Service Instance Dehydration. It’s both dangerous and inefficient to keep an instance of a long running service in memory when it’s idle. In order to maximize integrity (i.e. service instances survive a system crash) as well as resource utilization (i.e. we’re not wasting memory/CPU/etc on idle service instances), service instances must be dehydrated to disk.

In addition to these capabilities, Service Broker supports something called Conversation Group Locking, which turns out to be important when building highly scalable long running services. Furthermore, my understanding is that Conversation Group Locking is a feature unique to Service Broker, not only across Microsoft’s products but across the industry. Basically, it means that inbound messages for a specific long running service instance are locked so they can’t be processed on more than one thread at a time.

Here’s an example: let’s say I’m processing a Cancel Order message for a specific order when the Ready to Ship message arrives for that order arrives. With Conversation Group Locking, the Ready to Ship message stays locked in the queue until the Cancel Order message transaction is complete, regardless of the number of service threads there are. Without Conversation Group Locking, the Ready to Ship message might get processed by another service thread at the same time the Cancel Order message is being processed. The customer might get notified that the cancellation succeeded while the shipping service gets notified to ship the product. Oops.

There’s also an almost-natural fit between Service Broker and Windows Workflow. For example, a Service Broker Conversation Group and a WorkflowInstance are roughly analogous. They even both use a Guid for identification, making mapping between Conversation Group and WF Instance simple and direct. I was able to get prototype Service Broker / WF integration up and running in about a day. I’ll post more on that integration later this week.

Last but not least, Service Broker is wicked fast. Unfortunately, I don’t have any public benchmarks to point to, but the Service Broker team told me about a private customer benchmark that handled almost 9,000 messages per second! One of the reasons Service Broker is so fast is because it’s integrated into SQL Server 2005, which is is pretty fast in it’s own right. Since Service Broker is baked right in, you can do all your messaging work and your data manipulation within the scope of a local transaction.

Service Broker has a few rough areas and it lacks a supported managed api (though there is a sample managed api available). Probably the biggest issue is that Service Broker has almost no interop story. If you need to interop with a Service Broker service, you can use SQL Server’s native Web Service support. or the BizTalk adapter for Service Broker from AdapterWORX. However, I’m not sure how many of Service Broker’s native capabilities are exposed if you use these interop mechanisms. You would probably have to write a bunch of application code to make these capabilities work in an interop scenario.

Still, I feel Service Broker’s unique set of capabilities, its natural fit with WF and its high performance make it the best choice for building my project’s long running services. Is it the best choice for your project? I have no idea. One of the benefits of working for MSIT is that I get to focus on solving a specific problem and not on solving general problems. I would say that if you’re doing exclusively atomic or stateless services, Service Broker is probably overkill. If you’re doing any long running services at all, I would at least give Service Broker a serious look.

QOTD – Rick Barnes

As usual, I’m behind on blogging. This quote is actually from last Tuesday.

“Sunshine is a terrific bleach”
Rick Barnes

Rick, by the way, is my manager.