Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Kinda wonder at this point what findings exist on their Availability SOC 2, assuming they've gotten one.

The repeated outages plus the constant malicious advertising by scammy ad providers through cloudflare are slowly turning me off to the service as a potential enterprise customer. Unfortunate too since plenty of superlatively qualified people build great things there (hat tip to Nick Sullivan), but it seems like the build-fast culture may now be impeding the availability requirements of their clients.

This is also a great example of a case where SLAs are meaningless without rigorous enforcement provisions negotiated in by enterprise clients. Cloudflare advertises 100% uptime (https://www.cloudflare.com/business-sla/) but every time they fall over, they're down for what, an hour at a time? Just this one issue would've blown anyone else's 99.99% SLA out of the water -- https://www.cloudflarestatus.com/incidents/tx4pgxs6zxdr

I love the service, but if I'm to consider consuming the service, they'd do well to have the equivalent of a long term servicing branch as its own isolated environment, one where changes are only merged in once they've proven to be hyper-stable.



An SLA of 100% just mean your account will be credited for any downtime. It doesn't mean that the company guarantees 100% uptime. No company signs a 100% or 99.99% SLA expecting to actually get 99.99% uptime but with the understanding they will be compensated when their is an issue.

None of the major cloud vendors actually hit 99.99% uptime.


> None of the major cloud vendors actually hit 99.99% uptime.

None of them even promise that -- last time I checked, it was 99.95% for most of them.


AWS services have their own individual SLAs. Route53, in particular, has a 100% SLA: https://aws.amazon.com/route53/sla/

(To my knowledge, it's the only AWS service to promise 100%.)


Interesting distinction here: 100% SLA on responding to incoming DNS requests. The R53 console or management interfaces could be down and the SLA stays in tact-- if you can't update your DNS then 100% incorrect responses isn't very helpful.


Very true. I wonder if the control plane is hosted in a single region.


By its very nature, an SLA of 100% is a guarantee that the service will be available 100% of the time or else the relevant penalties, explicitly stated or otherwise applicable, can be applied.

The question is whether the guarantee is meaningful by way of whether the penalties will significantly dissuade failures to meet the guarantee, and I'd argue in the case of Cloudflare, this isn't the case.

[Edit: Cloudflare's standard] penalty is a service credit defined as follows:

> 6.1 For any and each Outage Period during a monthly billing period the Company will provide as a Service Credit an amount calculated as follows: Service Credit = (Outage Period minutes * Affected Customer Ratio) ÷ Scheduled Availability minutes

https://www.cloudflare.com/business-sla/

And that's woefully inadequate for any enterprise client with mission- or life-critical services.

---

TL;DR: A SLA is a guarantee, by the very definition of the word "guarantee," that a service will be delivered to a specific level and that certain agreed-upon penalties will be applied to the service provider if this guarantee is not met.

Edited for tone.


As you note, unless you negotiate a custom contract, usually the "penalties" are very, very mild. It's effectively the same as there not being a penalty at all. The SLA is just a marketing nice-to-have divorced from the engineering realities


Yup, and given Cloudflare's recent performance, I'd venture that more heavy-handed contracts need to be negotiated with them to drive an improvement in performance, or at the very least a paradigm shift in how they sustain availability to the clients who really care for it.


As an engineer, I get pissed whenever I see 100% uptime, or eleven-nines, nine-nines, or other impossible targets. Like, how am I supposed to design a system with numbers like that?


The thing is, a real SLA will have things like time to detect errors and time to mitigate, time to repair, etc.

100% uptime doesn't necessarily mean nothing failed, it means the failure detection and mitigation worked within the allowed windows. In a typical internet environment, that means allowing connections to die when the server they're connected to dies. It's would be possible to handoff tcp connections, but nobody does it.

If you want to get close to those numbers, you need to have a real reason, and then you need to make sure you have a good plan for everything that can go wrong. Power, routers, fiber, load balancers, switches, hosts, etc. And then do your best not to push bad software / bad configuration.

Bare metal on quality hardware with redundant networking goes a long way towards reliability, once the kinks are worked out.


SLAs aren't for engineers, they are for financial people to make agreements on payments for downtime.


Good SLAs are also for engineers.


That's what SLOs are for


If you only look at the SLOs you are a junior engineer working for someone else making the big decisions. If you are designing a system you want to look at the SLA. Engineers are not just assembly line workers that consume specs and spit out parts.

Nothing wrong with just using SLOs, but if you are a technical lead or senior engineer, you should have the big picture.


Deploy once, never update, and deploy a missile defense to prevent backhoes from digging up fiber?


You honestly think a missile defense system will work. Backhoes are much more creative than that. You will need defense in depth, roaming patrols, as well as air and satellite based monitoring assets.


And then the fibre will get cut by a building crew working on a guard tower.



Ah yes, missile defenses, like the MIM-104 Patriot: https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...


You don't. Like others have said SLA's are a sales tool, not an objective for the engineering team to achieve.


Excuse my curiosity, but what are numbers that you would find acceptable as targets?

Or - if you prefer - what is the "reasonable" percentage of issues - timewise - for an internet service?


I agree, advertised SLAs are garbage.

Agreements to uphold past performance are much better.


SLAs are all about getting compensation when they are broken. It isn't about actual uptime.


There was a point in time where that wasn't true but as people started accepting a lower quality of service it became easier to just pay than do the right thing.


You can read our SOC3 (public facing SOC2) if you're curious about your availability question: https://www.cloudflare.com/compliance/

There's a lot of good info in there


Ah, hi Evan.

There's a lot of good info here, but there are many more questions raised in my mind based on what I'm reading in the SOC3 than perhaps what you might've expected. I can ideally run through them if I catch you again at DEF CON this year. I'm also willing to sign your standard MNDA to review your SOC 2, but we can take that thread offline.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: