Friday, April 16, 2010

The dreaded so-called "geographical failover" and SIP

This interesting requirement just keeps popping-up in the middleware world and is a topic of many threads in developer discussions as we are designing our HA architecture to be adequate and performant. The problem stems from a common deployment architecture where you have two or more data-centers in distant geographical locations (think WAN distance, where LAN is not possible) . Each data-center site has the same service deployed, but is responsible for serving only the requests from their geographical area. All data centers are independently fault-tolerant, but if one site goes down for any reason, this would cause service outage in the area it is supposed to cover.

Geographical failover usually refers to being able to temporary redirect the requests to other more distant data-center sites when a local total data-center failure occurs. Even if it is slower or more expensive, it is better than a completely unavailable service.

This problem is common for HTTP and SIP, and the strategies to solve it are very similar. The challenge is in two aspects:
  • Load balancing. Obviously, you need to make sure your local load balancers survive the data center failure. There are many ways this can done, but commonly you can just host the load balancers somewhere close to the users and keep it isolated from the data center. You should also make sure there is enough backup power and network infrastructure to keep them alive for a few hours or days.
    In Mobicents and JBCP, we introduced pluggable algorithms for the SIP/HTTP converged load balancer, and it is not hard at all to come up with a working solution. For instance, the load balancer algorithm may have two lists of server nodes - local nodes and distant nodes. If and only if all local application server nodes are dead, then the load balancer can start routing to the distant nodes. Plain and simple.
    If you are using a distributed load balancer (shown in the figure on right), one per site, it will behave the same way, and you may have several additional options depending on the IP load balancer capabilities.

    As long as the SIP load balancers are aware of the topology, the IP load balancer doesn't have to be aware of the other data-centers. It might be possible to instruct the IP load balancer to route requests to the backup data-center site, but the mechanism doesn't depend on it.
  • State Replication (a.k.a What happens with the ongoing calls after the failure?). In the case of complete data-center failure, the whole LAN cluster will go down. Let's first take a look at the case where Mobicents nodes only replicate state within the data-center network.
    The ongoing calls are likely to have some state associated with them and this state will be lost without replication. Unlike HTTP, in SIP the containers generally do not deliver requests to the applications if their dialog state doesn't exist in the protocol stack. SIP has strict rules about the state machines in the dialogs and transactions, thus the overall consistency depends on that state. All SIP requests sent after the failure would not succeed and the calls are technically lost. But, let's look at what exactly happens with lost calls in the following cases:
    • Simple media calls, audio or video calls - once the SDP exchange is complete the users can hear each-other and have a normal call without any SIP messages. So the failure in the SIP server will not affect them at all. As long as the Media Server is alive or the Media is flowing directly between the User Agents, the call will be fine. When the users hang-up, the BYE is lost, but it won't matter any more, because the call is over. Unless you need to capture the BYE and do some logic, you will be fine after the failure.
    • SIP-intensive applications - presence, chat, some IVR applications and so on. In these cases the application will fail due to the lost state.
Being aware of the limitations, lets see what are the options of for state replication between data-centers.

First, you have to realize that if you had enough network capacity to replicate the call state to distant nodes, you could just organize all nodes in small JBoss clusters where the nodes are from different geographical areas. JBoss already has a lot of settings in JBoss Cache/Infinispan and JGroups to fine-tune the connections, even if there is no dedicated feature for geography.

Thus, let's assume you don't have enough bandwidth for that. At this point it is very-application specific, because you must decide what part of the state you can lose. There are several options:
  • Partial replication per call between the geographical areas - lose some bits from each call
  • Rare replication - do the full replication, but only in steady-state calls when changes are unlikely to occur.
  • Priority calls - pick high priority calls and only replicate them.
  • No replication - invest in reducing the risk of complete data center failure. How hard is that? If the data-center is down, but your users are still online, this implies that there is massive working infrastructure somewhere. Moreover, power and network backup resources are available cheaply and they scale well. In fact, I can think of many reasons why the geographical replication methods mentioned above are not a good idea:
    • There may be a noticeable memory impact. In most implementations the local and the geographically replicated call state resides in different memory structures (so double the memory).
    • The performance impact may be huge - first because of the additional replication protocol logic in each node and second because your data centers must run at lower utilization to be ready to take the extra load in case of data-center failure.
    • Incidentally, the long distance network bandwidth is the most expensive. So you will have to pay for that as well.
    • Investing in hardware fault-tolerant resources for your data-center makes the reliability independent of the software. In other words - even cluster-unaware applications will benefit and will be more reliable.
    • Even if you recover some SIP state and your SIP applications don't fail, you still need to think how to keep the Media Servers alive or to switch them over to the other data-center.
    • Partial replication in general will still produce a lot of lost calls. It is just on best-effort basis.
    • Application are hard to code and test in partially replicated environment. It is an explosion of cases to be considered and tested when developing.
In summary, the geographically redundant load balancing is justified and easy to achieve at reasonable price. However state replication between data-centers is too expensive. It is much more efficient to focus on reducing the probablity of data-center failure, which is low anyways. Unless you have some service-specific condition that really fits the model, there is no point in state replication between data-centers.