[VoiceOps] Geographic redundancy

Aug. 13, 2009

      It works in that (a)  BroadWorks can synchronize between servers at  
different sites. (b) When either of the sites fails, service is  
restored automatically and quickly after the fault. (c) When the fault  
is corrected, BroadWorks servers re-synchronize gracefully without  
interruption to subscribers.

But the hard part is not the BroadWorks servers -- it's the access  
network.

In most cases, the geographic distribution the voip servers  
(BroadWorks application servers, in this discussion) means that you  
have geographically-distributed SBCs. And such SBCs are usually not  
call-state synchronized.

And typically that means that each SBC has a separate IP address  
facing subscribers.

And most subscribers only have a single link to the access network.

The customer's device has to handle this properly, and the SBC has to  
handle this properly.

For example, when a failure of one site occurs, you want all of your  
devices to re-register through the secondary site. How long will it  
take before they re-register? How will they detect that one site has  
failed? What happens to calls headed toward the subscriber during the  
period until the customer re-registers?

For the SBC, how will it handle the mass re-registration from  
subscribers moving over from the other SBC? How will it protect the  
core registrar (BroadWorks AS in our example) against attack?

Nathan, "Zero" impact on customers is incredibly expensive to achieve.  
You can, in fact, engineer capacity to make this switching (and even  
route flapping) graceful, but it means you have orders-of-magnitude  
more expense in your access network. An SBC that can handle 10,000  
subscribers today might be able to handle 100 subscribers if we need  
to ensure zero new calls are lost, because each those subscribers has  
to hammer away at the SBC doing polling. It's probably less costly to  
bring each subscriber to each of your two sites, then to put more  
failover at the customer premise (like a smart ALG).

Nevertheless, without call state synchronization in the SBCs, it may  
not be possible to achieve full site-to-site failover with Nathan's  
Zero affect on customers. For example: If a call started on SBC-site-A  
and then fails to SBC-site-B, SBC-site-B would normally reject re- 
INVITES for that dialog; therefore session audits, call hold/resume,  
etc. can cause the standing call to drop.

Application Server / Call Server redundancy is great, and there's much  
more to consider in fault-tolerant voip network designs.

On Aug 13, 2009, at 3:19 PM, Nathan Stratton wrote:
...
On Thu, 13 Aug 2009, Mark Holloway wrote:
...
When you say "it works" - what is the impact to the customer?
Zero
Mark R Lindsey lindsey at e-c-group.com http://e-c-group.com/~lindsey  
+12293160013

[VoiceOps] Geographic redundancy

lindsey＠e-c-group.com