File Transfer Service OWLCS includes an Oracle-proprietary file transfer

16-14 Oracle WebLogic Communications Server Administration Guide Presence server is collocated with a User Dispatcher each such callback would be within the same JVM.

16.3.3.2.2 Presence Application Broadcast Another solution is to have the Presence

servers guarantee that a user only exists on one Presence node at any given time. This can be done by having the Presence application broadcast a message to all its neighbors when it receives a PUBLISH or SUBSCRIBE for a new presentity a presentity that it does not already have a state for. If any other Presence node that receives this broadcast message already has active subscriptions for this presentity, that server must terminate that subscription so that the client can establish a new subscription with the new server. With this functionality in the Presence application, the User Dispatcher would not have to perform additional steps to migrate a user from one live node to another.

16.3.3.3 Standby Server Pool

Another approach is to have a standby pool of servers that are idling ready to take over traffic from a failing node. When an active node fails the User Dispatcher will redistribute all its traffic to one server from the standby pool. This node will now become active and when the failing node eventually is repaired it will be added to the standby pool. This will eliminate the need for migrating users “back” from a live node when a failing node resumes. This approach requires more hardware and the utilization of hardware resources will not be optimal.

16.3.3.4 Failure Types

There are several types of failures that can occur in a Presence server and different types of failures may require different actions from the User Dispatcher.

16.3.3.4.1 Fatal Failures If the failure is fatal all state information is lost and established

sessions will fail. However, depending on the failure response, subscriptions presence subscribe sessions can survive using a new SIP dialog. If the response code is a 481 the presence client must according to RFC 3265 establish a new SUBSCRIBE dialog and this is not considered to be a failure from a presence perspective. All other failure responses may depending on the client implementation be handled as an error by the client and should therefore be considered a failure. After a fatal failure the server does not have any dialog states from the time before the failure, which means that all subsequent requests that arrive at this point will receive a 481 response back. During the failure period all transactions both initial and subsequent will be terminated with a non-481 error code, most likely a 500 or an internal 503 or 408 depending on if there is a proxy in the route path or not, and what the nature of the failure is. Typically a fatal failure will result in the server process or the entire machine being restarted.

16.3.3.4.2 Temporary Failures A temporary failure is one where none or little data is lost

so that after the failure session states will remain in the server. This means that a subsequent request that arrives after the server has recovered from the failure will be processed with the same result, as it would have been before the failure. All requests that arrive during the failure period will be responded with a non-481 failure response, such as 503. Deployment Topologies for Communication Services 16-15 In general a temporary failure has a shorter duration, and a typical example is an overload situation in which case the server will respond 503 on some or all requests.

16.3.3.5 Failover Actions

The User Dispatcher can take several actions when it has detected a failure in a Presence server node. The goal with the action is to minimize the impact of the failure in terms of number of failed subscriptions and publications and the time it takes to recover. In addition to this the User Dispatcher needs to keep the distribution as even as possible over the active servers. The fail-over action to be used in this version of the User Dispatcher is to disable the node in the pool. This approach is better than removing the node because when the ResizableBucketServerPool is used since the add and remove operations are not deterministic. This means that the result of adding a node depends on the sequence of earlier add and delete operations, whether as the disable operation will always result in the same change in distribution given the set of active and disabled nodes.

16.3.3.6 Overload Policy

An activated overload policy can indicate several types of failures but its main purpose is to protect from a traffic load that is to big for the system to handle. If such a situation is detected as a failure, fail-over actions can lead to bringing down the whole cluster since if the distribution of traffic is fairly even all the nodes will be in or near an overloaded situation. If the dispatchers remove one node from the cluster and redistribute that node’s traffic over the remaining nodes they will certainly enter an overload situation that causes a chain reaction. Since it is difficult to distinguish this overload situation from a software failure that triggers the overload policy to be activated even though the system is not under load, it might still be better to take the fail-over action unless Overload Policy is disabled. If the system is really in an overload situation it is probably under dimensioned and then the fail-over should be disabled. The User Dispatcher will not fail over when it has detected a 503 response which indicates overload policy activated. However, if a server is in the highest overload policy state where it drops messages instead of responding 503 the User Dispatcher monitor will receive an internal 408, which can never be distinguished from a dead server and failover will occur.

16.3.3.7 Synchronization of Failover Events

Depending on the failure detection mechanism there may be a need to synchronize the fail-over events or the resulting state between the different dispatcher instances. This is required if the detection mechanism is not guaranteed to be consistent across the cluster, such as an Error Response. For instance one server node sends a 503 response on one request but after that works just fine this can be due to a glitch in the overload policy. If there was only one 503 sent then only one dispatcher instance will receive it and if that event triggers a fail-over then that dispatcher instance will be out of sync with the rest of the cluster. Further, even if the grace period is implemented so that it takes several 503 responses over a time period to trigger the fail-over there is still a risk for a race condition if the failure duration is the same as the grace period. The following methods can be used to assure that the state after fail-over is synchronized across the cluster of dispatcher instances:

16.3.3.7.1 Broadcasting Fail-Over Events In this approach each dispatcher instance have

to send a notification to all other instances typically using JGroups or some other multicast technique when it has decided to take a fail-over action and change the set