
Am 02.02.2017 um 16:30 schrieb Voip Jacob:
The case that we're trying to protect against would be if both PBXs at both data centers were unreachable for our SIP provider (DNS issues, internal network routing issues, routing issues between SIP provider & datacenters, etc.). [...]
I can't help exactly with the call queues/groups thingie, but here's what I did recently for my inbound infrastructure where DIDs get routed to me from several carriers and routed from me to my customers, via SIP. I believe I have covered all possible outage scenarios with this setup: There are 2 data centers, geographically diverse. I colocate servers + routers in each. Let's call that a "site". Each site has two redundant (with VRRP) routers that speak BGP with the upstream routers. Per site, there are 2 physical servers with CentOS + KVM, and each server hosts 2 VMs: VM 1) Asterisk for SIP/RTP + Galera Cluster for database VM 2) quagga for BGP + kamailio as SIP proxy for load balancing/SIP failover A total of 8 VMs. (To start with) Network: I took a spare /24 IPv4 netblock I had lying around, and quagga running on the 4 quagga/kamailio VMs is announcing this prefix via BGP to the 4 internet-facing routers. Each quagga connects to both the active and standby router local to this site. That means a total of 8 BGP sessions, 4 per site, 2 per router. Announcing this prefix at multiple sites at the same time, where each site uses different upstream providers, results in that IPv4 prefix becoming "anycast'ed", meaning it is visible in the global routing table via multiple paths and the decision at which site IP packets for this prefix ends up on is made by the BGP algorithm (and by the provider where the traffic originates). A single IP address out of that /24 is up on all 4 quagga VMs as an alias address. Yes, the same IP address, four times. You may think this might be broken and cause problems if there is the same IP up in the same VLAN, but, the BGP algorithm on our internet-facing routers will choose one of the VMs and decide where to send all traffic to, the other VM will act as standby. There is no LAN traffic to/from this particular IP, so it just works. Initially I messed around with "keepalived" and some other tools, but it didn't work out, and running rather less userland daemons (which can crash, too) is better. :) Now, we tell the providers that we buy DIDs from: Hey, route all the SIP packets for us to this particular IP only. A single IP is all they need and get from us! No more "Please add our new IP address ...". Database: I use both Asterisk' dialplan and also A2Billing (a "VoIP Softswitch Solution", open source and free of charge) to route DIDs to customers. Because of A2B and because of CDRs we need a MySQL database. This is what Galera Cluster is for, a multi-master active-active replacement for MySQL. There are 4 Asterisk/database VMs, so there are 4 instances running which synchronize each other all the time, thus it does not matter to which instance you are sending your write requests. I simply use the local node for read + write and let Galera take care of the internals. There is also a 9th VM in a country far, far way which only runs Galera arbiter, does not store any MySQL data and simply acts like a decision-making component which is there to prevent split-brain situations because of the even node count. It's good that it's far away so it is aware when a whole site is down due to network issues. SIP + media: kamailio running on the quagga VM is the entry point for all inbound SIP traffic. A simple configuration which basically just says: There are 2 Asterisk servers to distribute the calls to. Check if they're both up. If one of them is down, send all traffic to the remaining server. If both are up, distribute evenly 50/50 so we get load balancing and all of our servers will actually process calls and not just sit idle until disaster comes. We let Asterisk handle all RTP and don't worry about an RTP proxy. Each Asterisk VM has an public IP address that is local to this site and unique. So I tell my customers: Please allow inbound calls from these IPs: 5.5.5.5 (site 1, server 1, VM 1) 5.5.5.6 (site 1, server 2, VM 1) 90.90.90.90 (site 2, server 1, VM 1) 90.90.90.91 (site 2, server 2, VM 1) Now, let's go through the possible disasters and see how this whole thing will react: - Data center lights up in a big ball of fire/upstreams go down/fiber cut etc.: site 1 is down, thanks to BGP anycast all traffic will instantly and without manual intervention go to site 2, and vice versa. - Router dies: Remaining router takes over, BGP sessions to both quaggas on each, BGP sessions to upstreams on each, VRRP between them. Instantly + no manual intervention. - Physical server dies: quagga VM goes down, BGP session disappears, BGP session to quagga VM on remaining physical server takes priority on router. (quagga + kamailio are active all the time on both VMs, on "standby" VM just sit idle until the other quagga disappears) - quagga VM down/reboot: BGP session disappears, remaining VM gets priority. - Asterisk crashes: kamailio detects this and sends all calls to remaining Asterisk. All MySQL data is always everywhere, always write-able, regardless of site blowup, server failure or VM crash. We don't worry about harddrive faults or filesystem redundancy (GlusterFS, ...). Keep it simple. If a server dies, we replace it. (We still use RAID, of course) I make all configuration changes on a single node. A simple script syncs the configuration with the other Asterisk'es and reloads them. Same for web access to A2B, a single node is designated for that, but you could easily make that redundant as well if web access is crucial, again thanks to BGP anycast. The cool thing is that it scales for both load and redundancy count, just add new servers as you please on any site and add them to Galera, kamailio, even BGP if you wish. You could even add more sites in other cities, countries, continents. If you have read until this point you have possibly figured out that if kamailio crashes on the quagga VM that has currently priority, calls will go to a black hole. I was too lazy to setup kamailio-failover...yet :) Also, since there are multiple carriers that deliver DIDs, spread over the world and using different upstreams, anycast really does its job and some traffic arrives at site 1, other traffic at site 2. By luck, it's currently close to a 50/50 distribution. Since I took the couple of days to implement that, I sleep so well again. :) Regards Markus PS: BGP anycast is awesome.