First published on MSDN on Apr 20, 2010
In this series of blog posts we will help you to design, develop and debug the Resource DLL you are developing to give your application high-availability with Windows Server 2008 & 2008 R2 Failover Clustering.
We recommend you start with the other blog post in the series:
In our last post we showed the following RHS Resource state machine and described how to interpret the diagram. We will be referring to this same state machine in this blog post.
In this post, the final post in the series, we will walk through some scenarios using the above state chart diagram.
First we will assume we have a resource that does not have dependencies, and this resource is offline. The user tells the cluster to online this resource. RCM will call to RHS, and RHS will call the Online entry point. From now on, the resource is in the Onlining state, and specifically it is in the “Online in Progress” sub-state. By default, the resource is allowed to remain in this state for up to 5 minutes. The diagram below shows the workflow if resource comes online successfully.
Please note that in this sample scenario the Resource DLL handles the online in a worker thread, which is not always required. You can find details on how to implement pending operations using a worker thread here http://msdn.microsoft.com/en-us/library/aa370471(VS.85).aspx .
If you are confident that online will complete within 5 minutes you can take all actions required to online the application within the Online call. Returning ERROR_SUCCESS from the Online call will move the RHS state machine directly to the Online state. You need to be careful when choosing if you need to pend Online/Offline call, because if Online does not complete within 5 minutes this will cause “Online Failure”, and will increase your application down time. We suggest to always pend Online and Offline unless it is very trivial, and can complete below 300 Milliseconds.
Now let’s review how the scenario might change if the Online call to the Resource DLL completes in a different way.
The Offline call is handled very similar to the Online call, since it can also be “pending”. Generally all the statements we made above for the Online (& Online Pending) call apply to the Offline (& Offline Pending) call.
While a resource is in the “Online in Progress” or “Offline in Progress” states, RHS will not send resource controls to the resource, but as soon as resource moves to the “Online Pending” or “Offline Pending” state resource will start receiving resource controls.
Please note that when Online and Offline are taking too long they are handled by the RHS differently from how other calls are handled. For Online and Offline calls, RHS notifies RCM. RCM is expected to issue a Terminate call. For the other calls with a timeout, RHS still notifies RCM, but after that RHS just terminates itself after creating a crash dump file (for more information about that see the following blog post http://blogs.msdn.com/clustering/archive/2009/06/27/9806160.aspx ). RCM will observe that RHS is gone and will take a recovery action.
The next scenario refers to the diagram below:
This image shows the case when user is trying to take a resource offline, and offline is taking too long due to a call which takes longer than 3 minutes. As soon as RHS has detected the timeout, it changes the resource state to “Failed/Terminating”, and reports that to RCM. In response, RCM issues a terminate call back to RHS. RHS calls the Resource DLL Terminate entry point. The Resource DLL takes an action to speed up the Online call completion. For instance, if the Resource DLL uses a socket to communicate with the application then Terminate might try to close the socket. After that Terminate waits for the Offline worker thread to complete, it then takes some action to offline the application. Please note that the Terminate call is also subject to the 5 minutes timeout. So if the Offline worker thread does not complete in time, or if Terminate will get stuck on its own, then RHS will take recovery actions. This time RHS will terminate itself.
As soon as resource moves to the Online state, then RHS starts health monitoring the resource. Resource can choose one of the two health monitoring modes:
Terminate call can run concurrently with any other call. Terminate is expected to perform the following tasks:
Let’s review the concurrency rules:
I hope this series of blog posts will be helpful when you design your own Resource DLL.
Senior Software Development Engineer
Clustering & High-Availability
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.