Plivo Offline, Domain Expired; Out of Business?

We use Plivo as a monitoring tool for SMS. They've been rock-solid and reliable, and did a good job of letting people know if things were broken. Their domain expired on April 17th, and now everything in their network is unreachable, including their status.plivo.com site. Is there anyone from Plivo on the list that cares to comment? There isn't anything on Twitter (@plivo) mentioning that they are aware of the situation and working on it, which concerns me. We started noticing issues mid-last week, but didn't dig into it (clearly should have). While their disappearance isn't affecting our customers, we have lost visibility of our off-net to on-net SMS testing. Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------

Looks like they are calling it a "DNS" issue -- not sure they noticed that their whois record expired and the root servers are using NOT their DNS servers to resolve plivo.com domains. https://docs.google.com/document/u/1/d/1hHIGcDQYb7nlL09-K1IOG_6mnVSDIHw_7D62... I wonder if it might be faster or them to get an alternative domain up as a stop-gap... Description We are experiencing an issue with DNS resolution of plivo.com domains due to a registrar renewal expiration. We are actively working on it. We recommend all our customers to use google DNS 8.8.8.8 and 8.8.4.4 for the following domains: api.plivo.com, phone.plivo.com, app.plivo.com, manage.plivo.com, sbc.plivo.com, eu.sbc.plivo.com, asia.sbc.plivo.com, au.sbc.plivo.com, sa.sbc.plivo.com Note: Other DNS resolvers may work as soon as the DNS for plivo.com are propagated but we cannot guarantee the TTL (refresh time) as it is out of our control. We apologize for the inconvenience, we are doing our best to get the service back. Recommended resolution: You can hardcode the following DNS entries while we are working on resolving the issue. Last update : Sun Apr 23 15:07:56 UTC 2017 DNS name IPs api.plivo.com 52.8.14.77 54.183.118.182 manage.plivo.com 54.215.176.31 52.8.129.90 app.plivo.com 54.241.31.73 phone.plivo.com 98.158.107.18 sbc.plivo.com 54.235.170.44 eu.sbc.plivo.com 54.247.99.121 asia.sbc.plivo.com 54.251.107.105 au.sbc.plivo.com 54.253.253.59 sa.sbc.plivo.com 54.233.255.206 Events Sun Apr 23 14:18:48 UTC 2017 While we are working with our DNS provider for plivo.com, we are currently deploying an alternative solution to get the service back online. Sun Apr 23 13:22:10 UTC 2017 We are working with our provider to expedite the renewal of the DNS. We will update you shortly. Might want to watch that Google Doc for updates Beckman On Sun, 23 Apr 2017, Peter Beckman wrote:
We use Plivo as a monitoring tool for SMS. They've been rock-solid and reliable, and did a good job of letting people know if things were broken.
Their domain expired on April 17th, and now everything in their network is unreachable, including their status.plivo.com site.
Is there anyone from Plivo on the list that cares to comment? There isn't anything on Twitter (@plivo) mentioning that they are aware of the situation and working on it, which concerns me. We started noticing issues mid-last week, but didn't dig into it (clearly should have).
While their disappearance isn't affecting our customers, we have lost visibility of our off-net to on-net SMS testing.
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------

I'd recommend for machines attempting to access Plivo services (servers or desktops) to use static entries in your /etc/hosts files (Linux/Unix servers, MacOS) or your C:\Windows\System32\Drivers\etc\hosts file (Windows). https://www.petri.com/easily-edit-hosts-file-windows-10 https://linux.die.net/man/5/hosts I added this line to my /etc/hosts file on servers needing to access Plivo servers statically (I only use the API): 54.183.118.182 api.plivo.com Now working for me. If you try to use the IP, you'll get a 301 HTTP Redirect to the named host, which will obviously fail. They need to change the HTTP server config to not do that, so right now the fastest way to get access back is through static entries on the servers that need to talk to Plivo. Beckman On Sun, 23 Apr 2017, Peter Beckman wrote:
Looks like they are calling it a "DNS" issue -- not sure they noticed that their whois record expired and the root servers are using NOT their DNS servers to resolve plivo.com domains.
https://docs.google.com/document/u/1/d/1hHIGcDQYb7nlL09-K1IOG_6mnVSDIHw_7D62...
I wonder if it might be faster or them to get an alternative domain up as a stop-gap...
Description
We are experiencing an issue with DNS resolution of plivo.com domains due to a registrar renewal expiration. We are actively working on it.
We recommend all our customers to use google DNS 8.8.8.8 and 8.8.4.4 for the following domains: api.plivo.com, phone.plivo.com, app.plivo.com, manage.plivo.com, sbc.plivo.com, eu.sbc.plivo.com, asia.sbc.plivo.com, au.sbc.plivo.com, sa.sbc.plivo.com
Note: Other DNS resolvers may work as soon as the DNS for plivo.com are propagated but we cannot guarantee the TTL (refresh time) as it is out of our control.
We apologize for the inconvenience, we are doing our best to get the service back.
Recommended resolution: You can hardcode the following DNS entries while we are working on resolving the issue. Last update : Sun Apr 23 15:07:56 UTC 2017 DNS name IPs api.plivo.com 52.8.14.77 54.183.118.182 manage.plivo.com 54.215.176.31 52.8.129.90 app.plivo.com 54.241.31.73 phone.plivo.com 98.158.107.18 sbc.plivo.com 54.235.170.44 eu.sbc.plivo.com 54.247.99.121 asia.sbc.plivo.com 54.251.107.105 au.sbc.plivo.com 54.253.253.59 sa.sbc.plivo.com 54.233.255.206
Events
Sun Apr 23 14:18:48 UTC 2017 While we are working with our DNS provider for plivo.com, we are currently deploying an alternative solution to get the service back online.
Sun Apr 23 13:22:10 UTC 2017 We are working with our provider to expedite the renewal of the DNS. We will update you shortly.
Might want to watch that Google Doc for updates
Beckman
On Sun, 23 Apr 2017, Peter Beckman wrote:
We use Plivo as a monitoring tool for SMS. They've been rock-solid and reliable, and did a good job of letting people know if things were broken.
Their domain expired on April 17th, and now everything in their network is unreachable, including their status.plivo.com site.
Is there anyone from Plivo on the list that cares to comment? There isn't anything on Twitter (@plivo) mentioning that they are aware of the situation and working on it, which concerns me. We started noticing issues mid-last week, but didn't dig into it (clearly should have).
While their disappearance isn't affecting our customers, we have lost visibility of our off-net to on-net SMS testing.
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ --------------------------------------------------------------------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------

I've done it. Just got distracted, got to fiddling around and thinkin' bout things, and before I know it, the domain's expired and I'm no longer on the register of respectable party guests... -- Alex
On Apr 23, 2017, at 12:11 PM, Gavin Henry <ghenry at suretec.co.uk> wrote:
How can you let your own core domain expire? _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops

On 23 April 2017 at 17:31, Alex Balashov <abalashov at evaristesys.com> wrote:
I've done it. Just got distracted, got to fiddling around and thinkin' bout things, and before I know it, the domain's expired and I'm no longer on the register of respectable party guests...
Embarrassing. I'm sure I'll do it at some point. :-)

We should all strive to NOT do that. We integrated a once a day check into our Monitoring platform that starts warning Operations 30 days before the domain expires, and actually pages people starting at 9am on Weekdays 7 days before if it hasn't been renewed. We had to tweak it for how our registrar publishes that information, and we automated renewals so it rarely goes off, but when it does we can get in front of it. We have the same thing in place for our public and internal SSL/TLS Certificates. If you are running a business on the web and don't automate monitoring of critical infrastructure, you get outages like this. Heck, we started monitoring the domain and SSL certs of our critical-path dependent services/vendors since another outage many years ago after an SSL cert expired. Plivo wasn't in our mix, as they aren't critical-path, but they are now, and they are still in alarm. Operations now will be automatically notified when we can actually see Plivo again. Beckman On Sun, 23 Apr 2017, Gavin Henry wrote:
On 23 April 2017 at 17:31, Alex Balashov <abalashov at evaristesys.com> wrote:
I've done it. Just got distracted, got to fiddling around and thinkin' bout things, and before I know it, the domain's expired and I'm no longer on the register of respectable party guests...
Embarrassing. I'm sure I'll do it at some point. :-)
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------

Just to clarify, you are saying that you monitor the domain and SSL cert of your vendors so you can notify them? That's cool. Sincerely, Keln Taylor 870-204-2121 kelntaylor at gmail.com On Sun, Apr 23, 2017 at 12:31 PM, Peter Beckman <beckman at angryox.com> wrote:
We should all strive to NOT do that. We integrated a once a day check into our Monitoring platform that starts warning Operations 30 days before the domain expires, and actually pages people starting at 9am on Weekdays 7 days before if it hasn't been renewed. We had to tweak it for how our registrar publishes that information, and we automated renewals so it rarely goes off, but when it does we can get in front of it.
We have the same thing in place for our public and internal SSL/TLS Certificates.
If you are running a business on the web and don't automate monitoring of critical infrastructure, you get outages like this. Heck, we started monitoring the domain and SSL certs of our critical-path dependent services/vendors since another outage many years ago after an SSL cert expired.
Plivo wasn't in our mix, as they aren't critical-path, but they are now, and they are still in alarm. Operations now will be automatically notified when we can actually see Plivo again.
Beckman
On Sun, 23 Apr 2017, Gavin Henry wrote:
On 23 April 2017 at 17:31, Alex Balashov <abalashov at evaristesys.com>
wrote:
I've done it. Just got distracted, got to fiddling around and thinkin' bout things, and before I know it, the domain's expired and I'm no longer on the register of respectable party guests...
Embarrassing. I'm sure I'll do it at some point. :-)
------------------------------------------------------------ --------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ------------------------------------------------------------ --------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops

Yep, we monitor our vendors. In some cases, better than they monitor themselves. It's frustrating that they don't/can't/won't, but wanting them to change doesn't help our customer experience. So we do it for them proactively. Too many outages on our vendor services have been caused by their ineptitude, such as this Plivo outage. We decided as a result of this Plivo ouatge that, while it was only merely annoying to our Operations team and didn't result in any customer-facing issues, we could have seen this coming and maybe averted this disaster for everyone by adding two config lines in our monitoring platform and a process to ensure notification to, followup with and closure of the issue with Plivo. The cost to our Operations team is small, the automated monitoring costs nothing, but the impact of knowing that our vendors are having (or will have) issues before they tell us or know themselves, AND before our customers complain, improves our customer experience and operational excellence despite our vendor's failings. It also gives us a chance to write defensive code to handle the situations where the vendor is not meeting their contractually obligated level of service. Beckman On Sun, 23 Apr 2017, Keln Taylor wrote:
Just to clarify, you are saying that you monitor the domain and SSL cert of your vendors so you can notify them? That's cool.
Sincerely, Keln Taylor 870-204-2121 kelntaylor at gmail.com
On Sun, Apr 23, 2017 at 12:31 PM, Peter Beckman <beckman at angryox.com> wrote:
We should all strive to NOT do that. We integrated a once a day check into our Monitoring platform that starts warning Operations 30 days before the domain expires, and actually pages people starting at 9am on Weekdays 7 days before if it hasn't been renewed. We had to tweak it for how our registrar publishes that information, and we automated renewals so it rarely goes off, but when it does we can get in front of it.
We have the same thing in place for our public and internal SSL/TLS Certificates.
If you are running a business on the web and don't automate monitoring of critical infrastructure, you get outages like this. Heck, we started monitoring the domain and SSL certs of our critical-path dependent services/vendors since another outage many years ago after an SSL cert expired.
Plivo wasn't in our mix, as they aren't critical-path, but they are now, and they are still in alarm. Operations now will be automatically notified when we can actually see Plivo again.
Beckman
On Sun, 23 Apr 2017, Gavin Henry wrote:
On 23 April 2017 at 17:31, Alex Balashov <abalashov at evaristesys.com>
wrote:
I've done it. Just got distracted, got to fiddling around and thinkin' bout things, and before I know it, the domain's expired and I'm no longer on the register of respectable party guests...
Embarrassing. I'm sure I'll do it at some point. :-)
------------------------------------------------------------ --------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ------------------------------------------------------------ --------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------

The information you gather by monitoring your vendors is priceless, if you are able to determine within a short period of time that your vendor is the problem,you have just saved precious time in identifying the problem. You can then reroute around the issue, instead of spending time in looking at your platform On Apr 23, 2017 3:10 PM, Peter Beckman <beckman at angryox.com> wrote: Yep, we monitor our vendors. In some cases, better than they monitor themselves. It's frustrating that they don't/can't/won't, but wanting them to change doesn't help our customer experience. So we do it for them proactively. Too many outages on our vendor services have been caused by their ineptitude, such as this Plivo outage. We decided as a result of this Plivo ouatge that, while it was only merely annoying to our Operations team and didn't result in any customer-facing issues, we could have seen this coming and maybe averted this disaster for everyone by adding two config lines in our monitoring platform and a process to ensure notification to, followup with and closure of the issue with Plivo. The cost to our Operations team is small, the automated monitoring costs nothing, but the impact of knowing that our vendors are having (or will have) issues before they tell us or know themselves, AND before our customers complain, improves our customer experience and operational excellence despite our vendor's failings. It also gives us a chance to write defensive code to handle the situations where the vendor is not meeting their contractually obligated level of service. Beckman On Sun, 23 Apr 2017, Keln Taylor wrote:
Just to clarify, you are saying that you monitor the domain and SSL cert of your vendors so you can notify them? That's cool.
Sincerely, Keln Taylor 870-204-2121 kelntaylor at gmail.com
On Sun, Apr 23, 2017 at 12:31 PM, Peter Beckman <beckman at angryox.com> wrote:
We should all strive to NOT do that. We integrated a once a day check into our Monitoring platform that starts warning Operations 30 days before the domain expires, and actually pages people starting at 9am on Weekdays 7 days before if it hasn't been renewed. We had to tweak it for how our registrar publishes that information, and we automated renewals so it rarely goes off, but when it does we can get in front of it.
We have the same thing in place for our public and internal SSL/TLS Certificates.
If you are running a business on the web and don't automate monitoring of critical infrastructure, you get outages like this. Heck, we started monitoring the domain and SSL certs of our critical-path dependent services/vendors since another outage many years ago after an SSL cert expired.
Plivo wasn't in our mix, as they aren't critical-path, but they are now, and they are still in alarm. Operations now will be automatically notified when we can actually see Plivo again.
Beckman
On Sun, 23 Apr 2017, Gavin Henry wrote:
On 23 April 2017 at 17:31, Alex Balashov <abalashov at evaristesys.com>
wrote:
I've done it. Just got distracted, got to fiddling around and thinkin' bout things, and before I know it, the domain's expired and I'm no longer on the register of respectable party guests...
Embarrassing. I'm sure I'll do it at some point. :-)
------------------------------------------------------------ --------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ------------------------------------------------------------ --------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ --------------------------------------------------------------------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops

With regard to the so-called priceless information, I would be careful. Monitoring vendors onesself is good, but it can lead to the misapprehension that the blame for an outage lies with the vendor where it in fact does not, but is instead caused by intermediate routing or whatever. In my experience, if used with conceit, such monitoring can cause as many problems as it solves. It certainly isn't priceless. That depends on the specific situation. On April 23, 2017 4:27:00 PM EDT, Alexander Lopez <alex.lopez at opsys.com> wrote:
The information you gather by monitoring your vendors is priceless, if you are able to determine within a short period of time that your vendor is the problem,you have just saved precious time in identifying the problem. You can then reroute around the issue, instead of spending time in looking at your platform
On Apr 23, 2017 3:10 PM, Peter Beckman <beckman at angryox.com> wrote: Yep, we monitor our vendors. In some cases, better than they monitor themselves. It's frustrating that they don't/can't/won't, but wanting them to change doesn't help our customer experience. So we do it for them proactively.
Too many outages on our vendor services have been caused by their ineptitude, such as this Plivo outage. We decided as a result of this Plivo ouatge that, while it was only merely annoying to our Operations team and didn't result in any customer-facing issues, we could have seen this coming and maybe averted this disaster for everyone by adding two config lines in our monitoring platform and a process to ensure notification to, followup with and closure of the issue with Plivo.
The cost to our Operations team is small, the automated monitoring costs nothing, but the impact of knowing that our vendors are having (or will have) issues before they tell us or know themselves, AND before our customers complain, improves our customer experience and operational excellence despite our vendor's failings.
It also gives us a chance to write defensive code to handle the situations where the vendor is not meeting their contractually obligated level of service.
Beckman
On Sun, 23 Apr 2017, Keln Taylor wrote:
Just to clarify, you are saying that you monitor the domain and SSL cert of your vendors so you can notify them? That's cool.
Sincerely, Keln Taylor 870-204-2121 kelntaylor at gmail.com
On Sun, Apr 23, 2017 at 12:31 PM, Peter Beckman <beckman at angryox.com> wrote:
We should all strive to NOT do that. We integrated a once a day check into our Monitoring platform that starts warning Operations 30 days before the domain expires, and actually pages people starting at 9am on Weekdays 7 days before if it hasn't been renewed. We had to tweak it for how our registrar publishes that information, and we automated renewals so it rarely goes off, but when it does we can get in front of it.
We have the same thing in place for our public and internal SSL/TLS Certificates.
If you are running a business on the web and don't automate monitoring of critical infrastructure, you get outages like this. Heck, we started monitoring the domain and SSL certs of our critical-path dependent services/vendors since another outage many years ago after an SSL cert expired.
Plivo wasn't in our mix, as they aren't critical-path, but they are now, and they are still in alarm. Operations now will be automatically notified when we can actually see Plivo again.
Beckman
On Sun, 23 Apr 2017, Gavin Henry wrote:
On 23 April 2017 at 17:31, Alex Balashov <abalashov at evaristesys.com>
wrote:
I've done it. Just got distracted, got to fiddling around and
thinkin'
bout things, and before I know it, the domain's expired and I'm no longer on the register of respectable party guests...
Embarrassing. I'm sure I'll do it at some point. :-)
------------------------------------------------------------ --------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ------------------------------------------------------------ --------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ --------------------------------------------------------------------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
-- Alex -- Principal, Evariste Systems LLC (www.evaristesys.com) Sent from my Google Nexus.

Still priceless. If our application servers cannot access critical resources from a carrier, regardless of the cause (network, application outage on their end, domain went un-registered), I now know it isn't working. That still is priceless. Monitoring should tell you what, not why. Blackbox monitoring (measure the experience customers experience) is wildly more valuable than whitebox monitoring (a DB query took more than 10 seconds, once). Your metrics system should help you answer why, once you know that something is wrong. Beckman Good reading: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcAp...
From a prior Google SRE
On Sun, 23 Apr 2017, Alex Balashov wrote:
With regard to the so-called priceless information, I would be careful. Monitoring vendors onesself is good, but it can lead to the misapprehension that the blame for an outage lies with the vendor where it in fact does not, but is instead caused by intermediate routing or whatever.
In my experience, if used with conceit, such monitoring can cause as many problems as it solves. It certainly isn't priceless. That depends on the specific situation.
On April 23, 2017 4:27:00 PM EDT, Alexander Lopez <alex.lopez at opsys.com> wrote:
The information you gather by monitoring your vendors is priceless, if you are able to determine within a short period of time that your vendor is the problem,you have just saved precious time in identifying the problem. You can then reroute around the issue, instead of spending time in looking at your platform
On Apr 23, 2017 3:10 PM, Peter Beckman <beckman at angryox.com> wrote: Yep, we monitor our vendors. In some cases, better than they monitor themselves. It's frustrating that they don't/can't/won't, but wanting them to change doesn't help our customer experience. So we do it for them proactively.
Too many outages on our vendor services have been caused by their ineptitude, such as this Plivo outage. We decided as a result of this Plivo ouatge that, while it was only merely annoying to our Operations team and didn't result in any customer-facing issues, we could have seen this coming and maybe averted this disaster for everyone by adding two config lines in our monitoring platform and a process to ensure notification to, followup with and closure of the issue with Plivo.
The cost to our Operations team is small, the automated monitoring costs nothing, but the impact of knowing that our vendors are having (or will have) issues before they tell us or know themselves, AND before our customers complain, improves our customer experience and operational excellence despite our vendor's failings.
It also gives us a chance to write defensive code to handle the situations where the vendor is not meeting their contractually obligated level of service.
Beckman
On Sun, 23 Apr 2017, Keln Taylor wrote:
Just to clarify, you are saying that you monitor the domain and SSL cert of your vendors so you can notify them? That's cool.
Sincerely, Keln Taylor 870-204-2121 kelntaylor at gmail.com
On Sun, Apr 23, 2017 at 12:31 PM, Peter Beckman <beckman at angryox.com> wrote:
We should all strive to NOT do that. We integrated a once a day check into our Monitoring platform that starts warning Operations 30 days before the domain expires, and actually pages people starting at 9am on Weekdays 7 days before if it hasn't been renewed. We had to tweak it for how our registrar publishes that information, and we automated renewals so it rarely goes off, but when it does we can get in front of it.
We have the same thing in place for our public and internal SSL/TLS Certificates.
If you are running a business on the web and don't automate monitoring of critical infrastructure, you get outages like this. Heck, we started monitoring the domain and SSL certs of our critical-path dependent services/vendors since another outage many years ago after an SSL cert expired.
Plivo wasn't in our mix, as they aren't critical-path, but they are now, and they are still in alarm. Operations now will be automatically notified when we can actually see Plivo again.
Beckman
On Sun, 23 Apr 2017, Gavin Henry wrote:
On 23 April 2017 at 17:31, Alex Balashov <abalashov at evaristesys.com>
wrote:
I've done it. Just got distracted, got to fiddling around and
thinkin'
bout things, and before I know it, the domain's expired and I'm no longer on the register of respectable party guests...
Embarrassing. I'm sure I'll do it at some point. :-)
------------------------------------------------------------ --------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ------------------------------------------------------------ --------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ --------------------------------------------------------------------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
-- Alex
-- Principal, Evariste Systems LLC (www.evaristesys.com)
Sent from my Google Nexus. _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------

Yes, it is good to know when things are down. Where I see people getting trouble with self-monitoring is making SLA claims to their vendors, or other financial or contract claims that require ironclad attribution of blame. Any piece of information like this is just a tool. It has epistemic limits. It may be valuable, but it is definitely not priceless. -- Alex
On Apr 23, 2017, at 6:42 PM, Peter Beckman <beckman at angryox.com> wrote:
Still priceless. If our application servers cannot access critical resources from a carrier, regardless of the cause (network, application outage on their end, domain went un-registered), I now know it isn't working.
That still is priceless. Monitoring should tell you what, not why. Blackbox monitoring (measure the experience customers experience) is wildly more valuable than whitebox monitoring (a DB query took more than 10 seconds, once).
Your metrics system should help you answer why, once you know that something is wrong.
Beckman
Good reading: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcAp...
From a prior Google SRE
On Sun, 23 Apr 2017, Alex Balashov wrote:
With regard to the so-called priceless information, I would be careful. Monitoring vendors onesself is good, but it can lead to the misapprehension that the blame for an outage lies with the vendor where it in fact does not, but is instead caused by intermediate routing or whatever.
In my experience, if used with conceit, such monitoring can cause as many problems as it solves. It certainly isn't priceless. That depends on the specific situation.
On April 23, 2017 4:27:00 PM EDT, Alexander Lopez <alex.lopez at opsys.com> wrote: The information you gather by monitoring your vendors is priceless, if you are able to determine within a short period of time that your vendor is the problem,you have just saved precious time in identifying the problem. You can then reroute around the issue, instead of spending time in looking at your platform
On Apr 23, 2017 3:10 PM, Peter Beckman <beckman at angryox.com> wrote: Yep, we monitor our vendors. In some cases, better than they monitor themselves. It's frustrating that they don't/can't/won't, but wanting them to change doesn't help our customer experience. So we do it for them proactively.
Too many outages on our vendor services have been caused by their ineptitude, such as this Plivo outage. We decided as a result of this Plivo ouatge that, while it was only merely annoying to our Operations team and didn't result in any customer-facing issues, we could have seen this coming and maybe averted this disaster for everyone by adding two config lines in our monitoring platform and a process to ensure notification to, followup with and closure of the issue with Plivo.
The cost to our Operations team is small, the automated monitoring costs nothing, but the impact of knowing that our vendors are having (or will have) issues before they tell us or know themselves, AND before our customers complain, improves our customer experience and operational excellence despite our vendor's failings.
It also gives us a chance to write defensive code to handle the situations where the vendor is not meeting their contractually obligated level of service.
Beckman
On Sun, 23 Apr 2017, Keln Taylor wrote:
Just to clarify, you are saying that you monitor the domain and SSL cert of your vendors so you can notify them? That's cool.
Sincerely, Keln Taylor 870-204-2121 kelntaylor at gmail.com
On Sun, Apr 23, 2017 at 12:31 PM, Peter Beckman <beckman at angryox.com> wrote:
We should all strive to NOT do that. We integrated a once a day check into our Monitoring platform that starts warning Operations 30 days before the domain expires, and actually pages people starting at 9am on Weekdays 7 days before if it hasn't been renewed. We had to tweak it for how our registrar publishes that information, and we automated renewals so it rarely goes off, but when it does we can get in front of it.
We have the same thing in place for our public and internal SSL/TLS Certificates.
If you are running a business on the web and don't automate monitoring of critical infrastructure, you get outages like this. Heck, we started monitoring the domain and SSL certs of our critical-path dependent services/vendors since another outage many years ago after an SSL cert expired.
Plivo wasn't in our mix, as they aren't critical-path, but they are now, and they are still in alarm. Operations now will be automatically notified when we can actually see Plivo again.
Beckman
On Sun, 23 Apr 2017, Gavin Henry wrote:
On 23 April 2017 at 17:31, Alex Balashov <abalashov at evaristesys.com>
wrote:
> > I've done it. Just got distracted, got to fiddling around and thinkin' > bout things, and before I know it, the domain's expired and I'm no longer > on the register of respectable party guests... >
Embarrassing. I'm sure I'll do it at some point. :-)
------------------------------------------------------------ --------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ------------------------------------------------------------ --------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ --------------------------------------------------------------------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
-- Alex
-- Principal, Evariste Systems LLC (www.evaristesys.com)
Sent from my Google Nexus. _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------

When you know that there is a customer-affecting outage and you can (a) get ahead of it or prevent it, (b) manage it, and (c) sometimes work around it, I'd argue that the value to your customers is priceless. The cost of NOT doing so is also priceless, or at least to your market value at the time before the issue to the time either (a) everyone forgets about the failure or (b) you reach $0. I look at this more as nice-to-have for SLA/Contract arguments but primarily in place to deliver a great customer experience, anticipating and measuring failure proactively. But indeed, we are arguing semantics, and pricelessness is difficult to measure. ;-) Beckman On Sun, 23 Apr 2017, Alex Balashov wrote:
Yes, it is good to know when things are down. Where I see people getting trouble with self-monitoring is making SLA claims to their vendors, or other financial or contract claims that require ironclad attribution of blame.
Any piece of information like this is just a tool. It has epistemic limits. It may be valuable, but it is definitely not priceless.
-- Alex
On Apr 23, 2017, at 6:42 PM, Peter Beckman <beckman at angryox.com> wrote:
Still priceless. If our application servers cannot access critical resources from a carrier, regardless of the cause (network, application outage on their end, domain went un-registered), I now know it isn't working.
That still is priceless. Monitoring should tell you what, not why. Blackbox monitoring (measure the experience customers experience) is wildly more valuable than whitebox monitoring (a DB query took more than 10 seconds, once).
Your metrics system should help you answer why, once you know that something is wrong.
Beckman
Good reading: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcAp...
From a prior Google SRE
On Sun, 23 Apr 2017, Alex Balashov wrote:
With regard to the so-called priceless information, I would be careful. Monitoring vendors onesself is good, but it can lead to the misapprehension that the blame for an outage lies with the vendor where it in fact does not, but is instead caused by intermediate routing or whatever.
In my experience, if used with conceit, such monitoring can cause as many problems as it solves. It certainly isn't priceless. That depends on the specific situation.
On April 23, 2017 4:27:00 PM EDT, Alexander Lopez <alex.lopez at opsys.com> wrote: The information you gather by monitoring your vendors is priceless, if you are able to determine within a short period of time that your vendor is the problem,you have just saved precious time in identifying the problem. You can then reroute around the issue, instead of spending time in looking at your platform
On Apr 23, 2017 3:10 PM, Peter Beckman <beckman at angryox.com> wrote: Yep, we monitor our vendors. In some cases, better than they monitor themselves. It's frustrating that they don't/can't/won't, but wanting them to change doesn't help our customer experience. So we do it for them proactively.
Too many outages on our vendor services have been caused by their ineptitude, such as this Plivo outage. We decided as a result of this Plivo ouatge that, while it was only merely annoying to our Operations team and didn't result in any customer-facing issues, we could have seen this coming and maybe averted this disaster for everyone by adding two config lines in our monitoring platform and a process to ensure notification to, followup with and closure of the issue with Plivo.
The cost to our Operations team is small, the automated monitoring costs nothing, but the impact of knowing that our vendors are having (or will have) issues before they tell us or know themselves, AND before our customers complain, improves our customer experience and operational excellence despite our vendor's failings.
It also gives us a chance to write defensive code to handle the situations where the vendor is not meeting their contractually obligated level of service.
Beckman
On Sun, 23 Apr 2017, Keln Taylor wrote:
Just to clarify, you are saying that you monitor the domain and SSL cert of your vendors so you can notify them? That's cool.
Sincerely, Keln Taylor 870-204-2121 kelntaylor at gmail.com
On Sun, Apr 23, 2017 at 12:31 PM, Peter Beckman <beckman at angryox.com> wrote:
We should all strive to NOT do that. We integrated a once a day check into our Monitoring platform that starts warning Operations 30 days before the domain expires, and actually pages people starting at 9am on Weekdays 7 days before if it hasn't been renewed. We had to tweak it for how our registrar publishes that information, and we automated renewals so it rarely goes off, but when it does we can get in front of it.
We have the same thing in place for our public and internal SSL/TLS Certificates.
If you are running a business on the web and don't automate monitoring of critical infrastructure, you get outages like this. Heck, we started monitoring the domain and SSL certs of our critical-path dependent services/vendors since another outage many years ago after an SSL cert expired.
Plivo wasn't in our mix, as they aren't critical-path, but they are now, and they are still in alarm. Operations now will be automatically notified when we can actually see Plivo again.
Beckman
On Sun, 23 Apr 2017, Gavin Henry wrote:
On 23 April 2017 at 17:31, Alex Balashov <abalashov at evaristesys.com> > wrote: > >> >> I've done it. Just got distracted, got to fiddling around and thinkin' >> bout things, and before I know it, the domain's expired and I'm no longer >> on the register of respectable party guests... >> > > Embarrassing. I'm sure I'll do it at some point. :-) > > ------------------------------------------------------------ --------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ------------------------------------------------------------ --------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ --------------------------------------------------------------------------- _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
-- Alex
-- Principal, Evariste Systems LLC (www.evaristesys.com)
Sent from my Google Nexus. _______________________________________________ VoiceOps mailing list VoiceOps at voiceops.org https://puck.nether.net/mailman/listinfo/voiceops
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman at angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------
participants (5)
-
abalashov@evaristesys.com
-
alex.lopez@opsys.com
-
beckman@angryox.com
-
ghenry@suretec.co.uk
-
kelntaylor@gmail.com