Transactions in Workflow Foundation-land

I’ve been spending some quality time with SSB and WF of late. On the balance, my opinion of both these technologies is very positive, though each has some warts of note. For Service Broker, they got the transactional messaging semantics right, but much of the lower level connection management – what SSB calls “routes” are clumsy to deal with. For Workflow Foundation, the execution model is amazingly flexible. Unfortunately, WF’s support for transactions is significantly more rigid.

If you’re build a SSB app, you’re typical execution thread looks like this:

  1. Start a transaction.
  2. Receive message(s) from top of the queue.
  3. Execute service business logic. Obviously, this varies from service to service but it typically involves reading and writing data in the database as well as sending messages to other services.
  4. Commit the transaction

When I sat down to marry SSB and WF, I naively assumed I could simply use WF for step three above. Alas, that turns out to be impossible. This thread on MSDN Forums has most of the gory details, but the short version is that WF does not support flowing host managed transactions into the workflow instance. As per Joel West in the aforementioned thread:

“[T]he WF runtime in V1 only supports flowing in a transaction on WorkflowInstance.Unload. There are various ways that you could try and hack this (with a custom persistence service or WorkflowCommitWorkBatchService) but if you do this it won’t work correctly 100% of the time and the times when it fails (error conditions or failures causing the tx to rollback) will be exactly when you are expecting transactional consistency.

Bottom line – the only way to make this work is to call WorkflowInstance.Unload inside your transaction scope.  This was the best that we could do in V1 to try and enable this pattern in some form.  Not always ideal but it can be made to work for most scenarios that require usage of an external transaction.”

So the WF compatible execution thread looks like this:

  1. Start a transaction
  2. Receive message(s) from the top of the queue
  3. Load/Create the associated workflow instance for the received messages
    • All messages received are guaranteed to be from the same SSB conversation group, which is roughly analogous to a WF instance, so this turns out to be fairly easy
  4. Enqueue the received message in the workflow instance
  5. Unload the workflow instance
  6. Commit the host transaction
  7. Reload the workflow instance
  8. Run the workflow instance (note, I’m using the manual scheduling service)
    • Workflow instance creates a transaction if needed
  9. Unload the workflow instance (typically done via UnloadOnIdle in the persistence service)
    • Assuming the workflow instance needed a transaction, it gets committed after unload

Basically, you use two transactions. One host managed transaction to move the message from SSB to WF instance and one WF managed transaction to process the message.The need for two transaction instead of one is unfortunate, but required given the current design of WF. And frankly, given the importance and difficulty of transaction management, I’m not that surprised that WF has hard coded transaction semantics. Trying to build a generic transaction flow model that would work in the myriad of scenarios WF is targeting would have been extremely difficult. At least there is a work around, even if it means using two transactions and loading and unloading the workflow instance twice.

However, there is a silver lining to the two transaction approach: two unexpected benefits when dealing with poison messages. First, SSB doesn’t have dead letter queue like MSMQ does. Moving a poison message to a dead letter queue would break SSB’s exactly once and in order semantics.(MSMQ doesn’t guarantee in order delivery) But moving all messages into the WF instance gets them out of the main SSB queue so poison messages don’t continue to get processed over and over.

Second, because the workflow instance is peristed after the messages are enqueued, there’s a representation of the workflow after the message is received but before the message is processed. If there’s a poison message, attempting to processing the message will fail and rollback to this state. This persisted workflow instance could be sent to a developer who could step through it to determine the cause of the error. We could even have developer versions of runtime workflow services so we could read remote data and simulate data updates. I wouldn’t want the developer updating production data in this way, but it would be great for troubleshooting issues.

Slight Workflow Annoyance

One of the cool things about WF is that you can specify the GUID it uses to identify a workflow instance. WorkflowRuntime.CreateWorkflow has an overload (actually two) where you can specify said workflow instance identifier. This is awesome for using WF with Service Broker, as Service Broker already has the idea of a conversation group which is roughly analogous to a workflow instance. Conversation groups even use a GUID identifier, so there’s not even any mapping required to go from conversation group to workflow instance.

However, things get less cool when you call WorkflowRuntime.GetWorkflow. If you call GetWorkflow with a GUID that has no corresponding workflow instance, it throws an InvalidOperationException instead of just returning null. That seems like an odd choice. If you’re going to support specifying the instance identifier when you create the workflow instance, doesn’t it make sense that you should also gracefully support the scenario where an instance identifier is invalid?

I see two ways to deal with this:

  • Iterate through the list of loaded and persisted workflow instances looking for the one in question.
  • Call GetWorkflow and swallow the exception.

I ended up picking the “Swallow the Exception” approach as I can’t imagine the iteration thru every loaded and persisted instance would be very performant. But swallowing exceptions always makes me feel icky. I’m a fan of the “exceptions only for exceptional situations” approach and as far as I’m concerned, an invalid instance identifier isn’t that exceptional. Still, it’s a minor annoyance, especially given how cool it is to be able to specify the workflow instance identifier in the first place.

The Two Types of Service Architects

Tomas Restrepo comments on my recent SSB and WCF posts:

Harry Pierson asks how well WCF supports long running tasks. He suggests that WCF does not support them very well, and says that’s one reason he likes SQL Server Service Broker so much. I’d say SSSB is a good match only as long as the long running tasks you’re going to be executing are purely database driven and can be executed completely within the database. Sure, this is an “expanded universe” with the CLR support in SQL Server 2005, but even so it makes me nervous at times 😄

You could also consider using a custom service with MSMQ or something like BizTalk Server for this if you had long running processes that were not completely tied to the DB (or a single DB for that matter).

Sam Gentile follows up:

In that same post, but I needed to call it out separate, Tomas rightfully says, “I’d say SSSB is a good match only as long as the long running tasks you’re going to be executing are purely database driven and can be executed completely within the database,” in response to Harry liking Service Broker so much. Talk about a narrow edge case. That’s way I never really got excited or cared about Service Broker. Its a narrow solution to a special edge case when everything is database driven and can be executed totally inside the database. That’s the old Microsoft Data-Driven Architecture for sure. Me, I’d rather have a rich Domain-Driven architecture most of the time. Then if you have Oracle databases in your architecture too, where does it leave you? Nowhere.

As you might expect, I have a few comments,  clarifications and corrections.

First, Tomas’ statement that Service Broker only supports service logic “executed completely within the database” in flat out wrong. Service Broker can be used from any environment that can connect to SQL Server and execute DML statements. If you can call SELECT/INSERT/UPDATE/DELETE, then you can also call BEGIN DIALOG/SEND/RECEIVE/END CONVERSATION. This includes Windows apps and services, web apps and services, console apps and even Java apps. Of course, you can also access Service Broker from stored procedures if you wish, but you’re not limited to them as Tomas suggested.

Tomas’ misconception may come from a feature of Service Broker called Activation. Activation is a feature of Service Broker that dynamically scales message processing to match demand. For example, Service Broker can be configured to launch a new instance of a specified stored procedure if messaging processing isn’t keeping up with incoming message traffic on a given queue. This is called internal activation and because it uses stored procedures it does execute within the database as Thomas said. Service Broker also supports external activation where it notifies an external application when activation is needed. You do have to build an application to host your service logic and handle these notifications, but that application doesn’t execute within the database. So while you could argue that it’s easier to execute your service logic within the database (no need to build a separate host app), it’s not required.

Given that you don’t have host your service logic in the database, then you’re also not limited to “a single DB” as Tomas suggests. You don’t, in fact, have to put your Service Broker queues in the same database with your business data. So if you have Oracle in your environment, like the scenario Sam mentioned, you would host your service logic in an external application that processed messages from a queue in a SQL 2005 database while accessing and modifying business data from tables in the Oracle database. Using multiple databases does require using distributed instead of local transactions, but if you’re using MSMQ as Tomas recommended, you’re already stuck with the DTC anyway.

Finally, I didn’t get Tomas’ “purely database driven” or Sam’s “everything is database driven” comments at all. While there are exceptions, the vast majority of systems I’ve ever seen/built/designed have essentially been one or more stateless tiers sitting in front of a stateful database. If it’s a traditional three tier web app, there’s a stateless presentation tier, a stateless business logic tier and a stateless data access logic tier. For a web service, there’s no presentation tier, but there’s is the stateless SOAP processing tier typically provided by the web service stack. Does this mean the vast majority of web apps and services are  “purely database driven” too? If so, then I guess it’s a good thing, right?

In the end, maybe there are two types of service architects – those that believe the majority of services will be atomic and those that believe the majority of services will be long running. For atomic services, Service Broker is overkill. But if it turns out that most services are long running, WCF’s lack of support is going to be a pretty big roadblock.

I’m obviously in the long running camp. I’m not sure, but I get the feeling this is the less popular camp, at least for now. We’ll have to wait to see, but I do know is that whenever someone brings me what they think is an atomic business scenario, it doesn’t take much digging to reveal that the atomic scenario is actually a single step of a long running business scenario that also needs to be automated.

Here’s a question for Tomas, Sam and the rest of you: Which group do you self select into? Are most services going to be atomic or long running in the (pardon the pun) long run?

The Other Foundation Technology

I mentioned last week that WF “is one of two foundation technologies that my project absolutely depends on”. Sam Gentile assumes the other foundation technology is WCF. It’s not.

As a quick reminder, my day job these days is to architect and deliver shared service-oriented infrastructure for Microsoft’s IT division. These services will be automating long running business operations. And when I say long running, I mean days, weeks or longer. While there will surely be some atomic or stateless services, I expect most of the services we build will be long running. Thus, the infrastructure I’m responsible for has to enable and support long running services.

The other foundation technology my project depends on is Service Broker. Service Broker was expressly designed for building these types of long running services. It supports several capabilities that I consider absolutely critical for long running services:

  • Service Initiated Interaction. Polling for changes is inefficient. Long running operations need support for the Solicit-Response and/or Notification message exchange patterns.
  • Durable Messaging. The first fallacy of distributed computing is that the network is reliable. If you need to be 100% sure the message gets delivered, you have to write it to disk on both sides.
  • Service Instance Dehydration. It’s both dangerous and inefficient to keep an instance of a long running service in memory when it’s idle. In order to maximize integrity (i.e. service instances survive a system crash) as well as resource utilization (i.e. we’re not wasting memory/CPU/etc on idle service instances), service instances must be dehydrated to disk.

In addition to these capabilities, Service Broker supports something called Conversation Group Locking, which turns out to be important when building highly scalable long running services. Furthermore, my understanding is that Conversation Group Locking is a feature unique to Service Broker, not only across Microsoft’s products but across the industry. Basically, it means that inbound messages for a specific long running service instance are locked so they can’t be processed on more than one thread at a time.

Here’s an example: let’s say I’m processing a Cancel Order message for a specific order when the Ready to Ship message arrives for that order arrives. With Conversation Group Locking, the Ready to Ship message stays locked in the queue until the Cancel Order message transaction is complete, regardless of the number of service threads there are. Without Conversation Group Locking, the Ready to Ship message might get processed by another service thread at the same time the Cancel Order message is being processed. The customer might get notified that the cancellation succeeded while the shipping service gets notified to ship the product. Oops.

There’s also an almost-natural fit between Service Broker and Windows Workflow. For example, a Service Broker Conversation Group and a WorkflowInstance are roughly analogous. They even both use a Guid for identification, making mapping between Conversation Group and WF Instance simple and direct. I was able to get prototype Service Broker / WF integration up and running in about a day. I’ll post more on that integration later this week.

Last but not least, Service Broker is wicked fast. Unfortunately, I don’t have any public benchmarks to point to, but the Service Broker team told me about a private customer benchmark that handled almost 9,000 messages per second! One of the reasons Service Broker is so fast is because it’s integrated into SQL Server 2005, which is is pretty fast in it’s own right. Since Service Broker is baked right in, you can do all your messaging work and your data manipulation within the scope of a local transaction.

Service Broker has a few rough areas and it lacks a supported managed api (though there is a sample managed api available). Probably the biggest issue is that Service Broker has almost no interop story. If you need to interop with a Service Broker service, you can use SQL Server’s native Web Service support. or the BizTalk adapter for Service Broker from AdapterWORX. However, I’m not sure how many of Service Broker’s native capabilities are exposed if you use these interop mechanisms. You would probably have to write a bunch of application code to make these capabilities work in an interop scenario.

Still, I feel Service Broker’s unique set of capabilities, its natural fit with WF and its high performance make it the best choice for building my project’s long running services. Is it the best choice for your project? I have no idea. One of the benefits of working for MSIT is that I get to focus on solving a specific problem and not on solving general problems. I would say that if you’re doing exclusively atomic or stateless services, Service Broker is probably overkill. If you’re doing any long running services at all, I would at least give Service Broker a serious look.