STUN Service The OWLCS STUN Service implements STUN Simple

Deployment Topologies for Communication Services 16-15 In general a temporary failure has a shorter duration, and a typical example is an overload situation in which case the server will respond 503 on some or all requests.

16.3.3.5 Failover Actions

The User Dispatcher can take several actions when it has detected a failure in a Presence server node. The goal with the action is to minimize the impact of the failure in terms of number of failed subscriptions and publications and the time it takes to recover. In addition to this the User Dispatcher needs to keep the distribution as even as possible over the active servers. The fail-over action to be used in this version of the User Dispatcher is to disable the node in the pool. This approach is better than removing the node because when the ResizableBucketServerPool is used since the add and remove operations are not deterministic. This means that the result of adding a node depends on the sequence of earlier add and delete operations, whether as the disable operation will always result in the same change in distribution given the set of active and disabled nodes.

16.3.3.6 Overload Policy

An activated overload policy can indicate several types of failures but its main purpose is to protect from a traffic load that is to big for the system to handle. If such a situation is detected as a failure, fail-over actions can lead to bringing down the whole cluster since if the distribution of traffic is fairly even all the nodes will be in or near an overloaded situation. If the dispatchers remove one node from the cluster and redistribute that node’s traffic over the remaining nodes they will certainly enter an overload situation that causes a chain reaction. Since it is difficult to distinguish this overload situation from a software failure that triggers the overload policy to be activated even though the system is not under load, it might still be better to take the fail-over action unless Overload Policy is disabled. If the system is really in an overload situation it is probably under dimensioned and then the fail-over should be disabled. The User Dispatcher will not fail over when it has detected a 503 response which indicates overload policy activated. However, if a server is in the highest overload policy state where it drops messages instead of responding 503 the User Dispatcher monitor will receive an internal 408, which can never be distinguished from a dead server and failover will occur.

16.3.3.7 Synchronization of Failover Events

Depending on the failure detection mechanism there may be a need to synchronize the fail-over events or the resulting state between the different dispatcher instances. This is required if the detection mechanism is not guaranteed to be consistent across the cluster, such as an Error Response. For instance one server node sends a 503 response on one request but after that works just fine this can be due to a glitch in the overload policy. If there was only one 503 sent then only one dispatcher instance will receive it and if that event triggers a fail-over then that dispatcher instance will be out of sync with the rest of the cluster. Further, even if the grace period is implemented so that it takes several 503 responses over a time period to trigger the fail-over there is still a risk for a race condition if the failure duration is the same as the grace period. The following methods can be used to assure that the state after fail-over is synchronized across the cluster of dispatcher instances:

16.3.3.7.1 Broadcasting Fail-Over Events In this approach each dispatcher instance have

to send a notification to all other instances typically using JGroups or some other multicast technique when it has decided to take a fail-over action and change the set