
It works in that (a) BroadWorks can synchronize between servers at different sites. (b) When either of the sites fails, service is restored automatically and quickly after the fault. (c) When the fault is corrected, BroadWorks servers re-synchronize gracefully without interruption to subscribers. But the hard part is not the BroadWorks servers -- it's the access network. In most cases, the geographic distribution the voip servers (BroadWorks application servers, in this discussion) means that you have geographically-distributed SBCs. And such SBCs are usually not call-state synchronized. And typically that means that each SBC has a separate IP address facing subscribers. And most subscribers only have a single link to the access network. The customer's device has to handle this properly, and the SBC has to handle this properly. For example, when a failure of one site occurs, you want all of your devices to re-register through the secondary site. How long will it take before they re-register? How will they detect that one site has failed? What happens to calls headed toward the subscriber during the period until the customer re-registers? For the SBC, how will it handle the mass re-registration from subscribers moving over from the other SBC? How will it protect the core registrar (BroadWorks AS in our example) against attack? Nathan, "Zero" impact on customers is incredibly expensive to achieve. You can, in fact, engineer capacity to make this switching (and even route flapping) graceful, but it means you have orders-of-magnitude more expense in your access network. An SBC that can handle 10,000 subscribers today might be able to handle 100 subscribers if we need to ensure zero new calls are lost, because each those subscribers has to hammer away at the SBC doing polling. It's probably less costly to bring each subscriber to each of your two sites, then to put more failover at the customer premise (like a smart ALG). Nevertheless, without call state synchronization in the SBCs, it may not be possible to achieve full site-to-site failover with Nathan's Zero affect on customers. For example: If a call started on SBC-site-A and then fails to SBC-site-B, SBC-site-B would normally reject re- INVITES for that dialog; therefore session audits, call hold/resume, etc. can cause the standing call to drop. Application Server / Call Server redundancy is great, and there's much more to consider in fault-tolerant voip network designs. On Aug 13, 2009, at 3:19 PM, Nathan Stratton wrote:
On Thu, 13 Aug 2009, Mark Holloway wrote:
When you say "it works" - what is the impact to the customer?
Zero
Mark R Lindsey lindsey at e-c-group.com http://e-c-group.com/~lindsey +12293160013