Storing Long-Lived Call State Data in an RDBMS

Deployment Topologies for Communication Services 16-17 The User Dispatcher will not do any fail-over but keep sending traffic to the failing node. In this case no sessions will be migrated to another node since all PUBLISH and initial SUBSCRIBE requests will be sent to the failing node. The initial SUBSCRIBES that arrives during the failure period will fail with a non-481 error likely 503. It is up to the client to try and setup a new subscription when the failing one expires or report a failure. All PUBLISH requests and initial SUBSCRIBE request will generate a failure. When the failing node resumes to normal operation all traffic will be processed again and no requests should fail. The time it takes until all presence states are “correct” again will be minimal since no sessions were failed-over. If the monitoring feature is implemented in a way that detects the node as “down” in this case, then some users will be migrated to another node and when this node comes back they will be migrated back again. This will generate some increased load for a duration of time. If the overload policy was activated because of a too high traffic load this migration is bad, since is will most likely happen again and since the other servers will most likely also be close to overload. This could lead to a chain reaction resulting in the whole cluster going down and a complete loss of service.

16.3.3.9.2 One Presence Server Overloaded Multiple Times for Five Seconds This use case

describes a Presence server that is going in and out from overload with short time periods such as 5 seconds. This is common if the system is under dimensioned and can barely cope with the traffic load, but it could also be caused by some other disturbance only on that particular node. The User Dispatcher will behave exactly as in One Presence Server Overloaded for 60 Seconds and the result will be the same except that the number of failed sessions and failed-over sessions will be smaller due to the shorter failure period.

16.3.3.9.3 Overload Policy Triggered by an OWLCS Software Failure A failure in the OWLCS

software or an application deployed on top of it causes all threads to be locked deadlock. This will eventually lead to that the in queue is filled up and the overload policy is activated even though the system is not actually overloaded. This is a permanent error that can only be solved by restarting the server. Depending on if and how the monitor function is implemented the number of affected users can be minimized. However this cannot be distinguished from a “real” overload situation in which case a fail-over may not be the best thing to do.

16.3.3.9.4 A Presence Server Hardware Failure The cluster consists of four Presence

servers, each node consisting of one OWLCS instance with a User Dispatcher and a Presence application deployed. 100,000 users are distributed over the four servers evenly 25,000 on each node. One of the presence servers crashes due to a hardware failure. A manual operation is required to replace broken server with a new one and only after two hours is the server up and running again. Depending on the type of the failure the response code sent back on transactions proxied to the failed node will be 408 or 503. In this case all sessions on this node will fail since the failure duration is most likely more than the expiration time for the subscriptions. If a monitor server is implemented with fail-over then the failure time will be minimized to the detection time seconds. The users will be migrated by the migration feature, which will create an increased load for a duration of time. Because the User Dispatcher was also running on the failed node, all the persisted data for the user dispatcher will be lost when replacing the server with a new machine.

16.3.3.9.5 Expanding the Cluster with One Presence Node The cluster consists of 3 Presence

servers, each node consisting of one OWLCS instance with a User Dispatcher and a