[VoiceOps] Ideas for Building Inbound Redundancy

Feb. 3, 2017

      Am 02.02.2017 um 16:30 schrieb Voip Jacob:
...
The case that we're trying to protect against would be if both PBXs at
both data centers were unreachable for our SIP provider (DNS issues,
internal network routing issues, routing issues between SIP provider &
datacenters, etc.). [...]
I can't help exactly with the call queues/groups thingie, but here's 
what I did recently for my inbound infrastructure where DIDs get routed 
to me from several carriers and routed from me to my customers, via SIP. 
I believe I have covered all possible outage scenarios with this setup:

There are 2 data centers, geographically diverse. I colocate servers + 
routers in each. Let's call that a "site". Each site has two redundant 
(with VRRP) routers that speak BGP with the upstream routers.

Per site, there are 2 physical servers with CentOS + KVM, and each 
server hosts 2 VMs:
VM 1) Asterisk for SIP/RTP + Galera Cluster for database
VM 2) quagga for BGP + kamailio as SIP proxy for load balancing/SIP failover

A total of 8 VMs. (To start with)

Network:

I took a spare /24 IPv4 netblock I had lying around, and quagga running 
on the 4 quagga/kamailio VMs is announcing this prefix via BGP to the 4 
internet-facing routers. Each quagga connects to both the active and 
standby router local to this site. That means a total of 8 BGP sessions, 
4 per site, 2 per router.

Announcing this prefix at multiple sites at the same time, where each 
site uses different upstream providers, results in that IPv4 prefix 
becoming "anycast'ed", meaning it is visible in the global routing table 
via multiple paths and the decision at which site IP packets for this 
prefix ends up on is made by the BGP algorithm (and by the provider 
where the traffic originates).

A single IP address out of that /24 is up on all 4 quagga VMs as an 
alias address. Yes, the same IP address, four times. You may think this 
might be broken and cause problems if there is the same IP up in the 
same VLAN, but, the BGP algorithm on our internet-facing routers will 
choose one of the VMs and decide where to send all traffic to, the other 
VM will act as standby. There is no LAN traffic to/from this particular 
IP, so it just works. Initially I messed around with "keepalived" and 
some other tools, but it didn't work out, and running rather less 
userland daemons (which can crash, too) is better. :)

Now, we tell the providers that we buy DIDs from: Hey, route all the SIP 
packets for us to this particular IP only. A single IP is all they need 
and get from us! No more "Please add our new IP address ...".

Database:

I use both Asterisk' dialplan and also A2Billing (a "VoIP Softswitch 
Solution", open source and free of charge) to route DIDs to customers. 
Because of A2B and because of CDRs we need a MySQL database. This is 
what Galera Cluster is for, a multi-master active-active replacement for 
MySQL. There are 4 Asterisk/database VMs, so there are 4 instances 
running which synchronize each other all the time, thus it does not 
matter to which instance you are sending your write requests. I simply 
use the local node for read + write and let Galera take care of the 
internals. There is also a 9th VM in a country far, far way which only 
runs Galera arbiter, does not store any MySQL data and simply acts like 
a decision-making component which is there to prevent split-brain 
situations because of the even node count. It's good that it's far away 
so it is aware when a whole site is down due to network issues.

SIP + media:

kamailio running on the quagga VM is the entry point for all inbound SIP 
traffic. A simple configuration which basically just says: There are 2 
Asterisk servers to distribute the calls to. Check if they're both up. 
If one of them is down, send all traffic to the remaining server. If 
both are up, distribute evenly 50/50 so we get load balancing and all of 
our servers will actually process calls and not just sit idle until 
disaster comes. We let Asterisk handle all RTP and don't worry about an 
RTP proxy.

Each Asterisk VM has an public IP address that is local to this site and 
unique. So I tell my customers: Please allow inbound calls from these IPs:

5.5.5.5 (site 1, server 1, VM 1)
5.5.5.6 (site 1, server 2, VM 1)
90.90.90.90 (site 2, server 1, VM 1)
90.90.90.91 (site 2, server 2, VM 1)

Now, let's go through the possible disasters and see how this whole 
thing will react:

- Data center lights up in a big ball of fire/upstreams go down/fiber 
cut etc.: site 1 is down, thanks to BGP anycast all traffic will 
instantly and without manual intervention go to site 2, and vice versa.

- Router dies: Remaining router takes over, BGP sessions to both quaggas 
on each, BGP sessions to upstreams on each, VRRP between them. Instantly 
+ no manual intervention.

- Physical server dies: quagga VM goes down, BGP session disappears, BGP 
session to quagga VM on remaining physical server takes priority on 
router. (quagga + kamailio are active all the time on both VMs, on 
"standby" VM just sit idle until the other quagga disappears)

- quagga VM down/reboot: BGP session disappears, remaining VM gets priority.

- Asterisk crashes: kamailio detects this and sends all calls to 
remaining Asterisk.

All MySQL data is always everywhere, always write-able, regardless of 
site blowup, server failure or VM crash. We don't worry about harddrive 
faults or filesystem redundancy (GlusterFS, ...). Keep it simple. If a 
server dies, we replace it. (We still use RAID, of course)

I make all configuration changes on a single node. A simple script syncs 
the configuration with the other Asterisk'es and reloads them. Same for 
web access to A2B, a single node is designated for that, but you could 
easily make that redundant as well if web access is crucial, again 
thanks to BGP anycast.

The cool thing is that it scales for both load and redundancy count, 
just add new servers as you please on any site and add them to Galera, 
kamailio, even BGP if you wish. You could even add more sites in other 
cities, countries, continents.

If you have read until this point you have possibly figured out that if 
kamailio crashes on the quagga VM that has currently priority, calls 
will go to a black hole. I was too lazy to setup kamailio-failover...yet :)

Also, since there are multiple carriers that deliver DIDs, spread over 
the world and using different upstreams, anycast really does its job and 
some traffic arrives at site 1, other traffic at site 2. By luck, it's 
currently close to a 50/50 distribution.

Since I took the couple of days to implement that, I sleep so well again. :)

Regards
Markus

PS: BGP anycast is awesome.

[VoiceOps] Ideas for Building Inbound Redundancy

universe＠truemetal.org