Kinda wonder at this point what findings exist on their Availability SOC 2, assu...

tick_tock_tick · on July 2, 2019

An SLA of 100% just mean your account will be credited for any downtime. It doesn't mean that the company guarantees 100% uptime. No company signs a 100% or 99.99% SLA expecting to actually get 99.99% uptime but with the understanding they will be compensated when their is an issue.

None of the major cloud vendors actually hit 99.99% uptime.

merlincorey · on July 2, 2019

> None of the major cloud vendors actually hit 99.99% uptime.

None of them even promise that -- last time I checked, it was 99.95% for most of them.

rwiggins · on July 3, 2019

AWS services have their own individual SLAs. Route53, in particular, has a 100% SLA: https://aws.amazon.com/route53/sla/

(To my knowledge, it's the only AWS service to promise 100%.)

syntheticcdo · on July 3, 2019

Interesting distinction here: 100% SLA on responding to incoming DNS requests. The R53 console or management interfaces could be down and the SLA stays in tact-- if you can't update your DNS then 100% incorrect responses isn't very helpful.

rwiggins · on July 3, 2019

Very true. I wonder if the control plane is hosted in a single region.

bryant · on July 2, 2019

By its very nature, an SLA of 100% is a guarantee that the service will be available 100% of the time or else the relevant penalties, explicitly stated or otherwise applicable, can be applied.

The question is whether the guarantee is meaningful by way of whether the penalties will significantly dissuade failures to meet the guarantee, and I'd argue in the case of Cloudflare, this isn't the case.

[Edit: Cloudflare's standard] penalty is a service credit defined as follows:

> 6.1 For any and each Outage Period during a monthly billing period the Company will provide as a Service Credit an amount calculated as follows: Service Credit = (Outage Period minutes * Affected Customer Ratio) ÷ Scheduled Availability minutes

https://www.cloudflare.com/business-sla/

And that's woefully inadequate for any enterprise client with mission- or life-critical services.

---

TL;DR: A SLA is a guarantee, by the very definition of the word "guarantee," that a service will be delivered to a specific level and that certain agreed-upon penalties will be applied to the service provider if this guarantee is not met.

Edited for tone.

opportune · on July 2, 2019

As you note, unless you negotiate a custom contract, usually the "penalties" are very, very mild. It's effectively the same as there not being a penalty at all. The SLA is just a marketing nice-to-have divorced from the engineering realities

bryant · on July 2, 2019

Yup, and given Cloudflare's recent performance, I'd venture that more heavy-handed contracts need to be negotiated with them to drive an improvement in performance, or at the very least a paradigm shift in how they sustain availability to the clients who really care for it.

klodolph · on July 2, 2019

As an engineer, I get pissed whenever I see 100% uptime, or eleven-nines, nine-nines, or other impossible targets. Like, how am I supposed to design a system with numbers like that?

toast0 · on July 3, 2019

The thing is, a real SLA will have things like time to detect errors and time to mitigate, time to repair, etc.

100% uptime doesn't necessarily mean nothing failed, it means the failure detection and mitigation worked within the allowed windows. In a typical internet environment, that means allowing connections to die when the server they're connected to dies. It's would be possible to handoff tcp connections, but nobody does it.

If you want to get close to those numbers, you need to have a real reason, and then you need to make sure you have a good plan for everything that can go wrong. Power, routers, fiber, load balancers, switches, hosts, etc. And then do your best not to push bad software / bad configuration.

Bare metal on quality hardware with redundant networking goes a long way towards reliability, once the kinks are worked out.

cortesoft · on July 2, 2019

SLAs aren't for engineers, they are for financial people to make agreements on payments for downtime.

klodolph · on July 2, 2019

Good SLAs are also for engineers.

dserodio · on July 3, 2019

That's what SLOs are for

klodolph · on July 3, 2019

If you only look at the SLOs you are a junior engineer working for someone else making the big decisions. If you are designing a system you want to look at the SLA. Engineers are not just assembly line workers that consume specs and spit out parts.

Nothing wrong with just using SLOs, but if you are a technical lead or senior engineer, you should have the big picture.

Lorin · on July 2, 2019

Deploy once, never update, and deploy a missile defense to prevent backhoes from digging up fiber?

JaimeThompson · on July 2, 2019

You honestly think a missile defense system will work. Backhoes are much more creative than that. You will need defense in depth, roaming patrols, as well as air and satellite based monitoring assets.

jacques_chester · on July 2, 2019

And then the fibre will get cut by a building crew working on a guard tower.

ohyeshedid · on July 3, 2019

https://i.imgur.com/rDW7W3d.png

klodolph · on July 2, 2019

Ah yes, missile defenses, like the MIM-104 Patriot: https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...

haggy · on July 2, 2019

You don't. Like others have said SLA's are a sales tool, not an objective for the engineering team to achieve.

jaclaz · on July 3, 2019

Excuse my curiosity, but what are numbers that you would find acceptable as targets?

Or - if you prefer - what is the "reasonable" percentage of issues - timewise - for an internet service?

bifrost · on July 2, 2019

I agree, advertised SLAs are garbage.

Agreements to uphold past performance are much better.

cortesoft · on July 2, 2019

SLAs are all about getting compensation when they are broken. It isn't about actual uptime.

bifrost · on July 3, 2019

There was a point in time where that wasn't true but as people started accepting a lower quality of service it became easier to just pay than do the right thing.

ejcx · on July 2, 2019

You can read our SOC3 (public facing SOC2) if you're curious about your availability question: https://www.cloudflare.com/compliance/

There's a lot of good info in there

bryant · on July 2, 2019

Ah, hi Evan.

There's a lot of good info here, but there are many more questions raised in my mind based on what I'm reading in the SOC3 than perhaps what you might've expected. I can ideally run through them if I catch you again at DEF CON this year. I'm also willing to sign your standard MNDA to review your SOC 2, but we can take that thread offline.