Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If a single regex can take down the Internet for a half hour, that's definitely not good -- for a class of errors that can be easily prevented, tested, etc.

The timing is unfortunate too, after calling out Verizon for lack of due process and negligence.

I'm sure they have an undo or rollback for deployments but probably worth investing into further.

They also need to resolve the catch-22 where people could not login and disable CloudFlare proxy ("orange cloud") since cloudflare.com itself was down.



> The timing is unfortunate too, after calling out Verizon for lack of due process and negligence.

Nonetheless, Verizon could take a leaf out of their responsiveness and transparency book.


Yeah. They criticized Verizon for being unresponsive. Mistakes happen.


You'd think after leaking private data for literally months less than 3 years ago (and only noticing because Google had to point it out to them) that they'd, y'know, have at least some kind of QA environment fed with sample traffic by now. Really hard to believe they're still getting caught testing in prod


For working in that field, the arrogance of CloudFlare is still unbelievable to me.

After their huge Cloudbleed issue with the addition of this one, they continue to call out everyone through their blog posts. And everyone seems fine with it because they are a hype company.


I don't use CloudFlare nor have any interest in them, but I don't see the arrogance. The issues CloudFlare have are things everyone takes seriously and are working very hard on. Deployment and memory safety are hard problems that happens to the best of the best. It happens Google, Amazon and Facebook. If anything the idea that this would damaging, because it is more public, is arrogant. If CloudFlare would be saying that everything is fine you might have a point, but they aren't. Just like the other companies mentioned they seem to be improving their routines, programming and infrastructure to try and mitigate these problems.

What they are criticising however are things like not adopting new protocols or not taking things that affects everyone seriously. This isn't something that would happen if people were trying. And the response from some of the industry is "we know what we are doing", and shortly after the same thing happens again and again and again.

So I don't really see CloudFlare being that arrogant, if anything it's the "you are not better than us" from some parts of the industry that is. The day I see CloudFlare not trying I would be happy calling them arrogant. But if anything I would caution that they are too successful by trying more than most.


> The issues CloudFlare have are things everyone takes seriously and are working very hard on. Deployment and memory safety are hard problems that happens to the best of the best.

Cloudflare improved a lot. You can see just from what they're open sourcing that the usage of go and rust increased significantly. And I'm sure we'll notice improvements in deployment practices.

When Cloudbleed happened I was very vocal and skeptical, but this is different. Everyone makes mistakes.


> Cloudflare improved a lot. You can see just from what they're open sourcing that the usage of go and rust increased significantly.

You say this like using trendy languages implicitly indicates improvement.


As a random outsider who really couldn't care less about the service CloudFlare provides: their responses to outages and transparency is really great and I wish more tech companies would do the same. It gets tiring hearing about large outages at over services/providers and only learning that they were caused by "network partitions", or other networking issues. Every company has to deal with these issues and CloudFlare does an awesome job at letting me at least learn something about what went wrong when these incidents happen.


We’ve actually had our data leaked by one of their engineers working in his free time. He found an open database and leaked in to the press. He was probably just scanning random ip ranges and stumbled upon it and I don’t think he was targeting CF clients in particular. Hopefully they will stay humble and fix their own issues first. On a side note an anecdote came out of that leak... We were then contacted by this big name tech website if the data is ours, before they published the article. Unfortunately the author sent us an email via his @gmail address which did not add to his credibility so his email was brushed off for a day or two until we saw it published. Can’t say if it was a dark pattern of his to not use his work email to notify us or not...


If he wasn't doing it as his job, using a work mail address to contact someone over a security issue sounds like it would have been a bad idea.


It should be taken as a given that testing is necessary but not sufficient to prevent production outages, or limit their impact.

Monitoring, canaries, experimantations do need to be adopted at pretty much everywhere possible.


> It should be taken as a given that testing is necessary but not sufficient to prevent production outages […]

That depends on how good your tests are.


And how good your employees are... How good your review process is... How good xyz is...

If your engineers are so solid, and them making a mistake on a given release is individually 0.5%, and you have 50 engineers, you will see the probability of nothing going wrong is about 77%(0.995^50), and something going wrong is 1-0.995^50. Pretty low, i might say.

Dont do this to your engineers. 80% test coverage is a sweet spot, the rest is caught better with other approaches. No reason to kill engineers productivity everytime something fails on production by blaming their tests arent good.


The probability of something going wrong should be 1-(P(something_nothing_wrong))

In this example, that’s 23%.


Since the work involved in doing a regular expression match can depend largely on the input for non-trivial expressions, one fun case (probably not the one here, though) is that a user of your system could start using a pathological case input that no amount of standard testing (synthetic or replayed traffic, staging environments, production canaries) would have caught.

Didn't take anything down, but did cause an inordinate amount of effort tracking down what was suddenly blocking the event loop without any operational changes to the system...


See https://swtch.com/~rsc/regexp/ to understand why that isn't necessarily true.


Cloudflare uses re2 which doesn't suffer this problem, but apparently they don't use it here?

https://github.com/cloudflare/lua-re2

https://github.com/google/re2

https://github.com/google/re2/wiki/WhyRE2


Sounds like a job for property based testing!


Fuzz testing could help


Yep, it could help in some cases.

It's nowhere near as standardly applied as the other approaches to release verification, though.

And in complex cases (say, a large multi-tenant service with complex configuration), it can be very hard to find the combination of inputs necessary to catch this issue. If you have hundreds of customer configurations, and only one of them has this particular feature enabled (or uses this sort of expression), fuzzing is less likely to be effective.


> If a single regex can take down the Internet for a half hour, that's definitely not good

As I commented yesterday, this is due to the fact, that "the Internet" thinks it needs to use Cloudflare services, although there really is no need to do so.

Stupid people making stupid decisions and then wondering why their services are down.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: