And I, for sure, will never use them. The AWS version of ES has been abysmal- it...

mrsuprawsm · on Jan 21, 2021

Personally me and my team just evaluated the AWS ES and Elastic offerings, and the AWS offering (surprisingly!) came out on top, for our use case. Better performance, better IaC support, and marginally cheaper.

Honestly I would have preferred the Elastic offering to work better, but that wasn't the case.

core-questions · on Jan 21, 2021

I send about 100GB a day to Amazon ES and it works fine. I used to maintain ES 2.x and 5.x on my own and it was more work for me personally at a slight cost savings.

What has been abysmal for you? Maybe your use case is more advanced than ours, which is mainly absorbing logs from all over the place and doing the typical dashboard and alerts on them (with Grafana).

halbritt · on Jan 21, 2021

> I send about 100GB a day to Amazon ES

This is why you haven't noticed any issues.

tinco · on Jan 21, 2021

Are you insinuating that's a small amount of data? You've got to be kidding.

geerlingguy · on Jan 21, 2021

I've seen devs toss crud into infra with debug logs enabled, with millions of lines of deprecated log messages, etc., and the infra budget eats their costs.

It's insane. Unless you're literally Facebook, or ingesting data from CERN's LHC... what possible use case requires 100GB of text data ingest per day?

Maybe it's a case of someone throwing a Service Mesh into a Microservices K8s cluster and logging all the things?

xxs · on Jan 22, 2021

4-5MB/min per VM for input/output traffic in compressed logs of application servers. Around 100ish VMs/site. That's a half GB/min. 700GB+/day in plain text logs per a single site from app servers alone.

Normally that's no issue as the data is stored in SANs and not sent onto the cloud for analysis, just giving a perspective.

core-questions · on Jan 21, 2021

It's absolutely caused by exactly those things you've mentioned. I think we could drop it down by 75% easily if we simply had people putting severity levels in correctly and disabled storing debug logs except in experimental environments.

90%+ of our logs are severity INFO or have no severity at all. It's like pulling teeth to even get devs to output logs using the corporate standard json-per-line format with mandatory fields.

Still, once you're running hundreds of VMs processing a big data pipeline it's not hard to end up with massive amounts of logs. It's not just logging, really, it's also metrics and trace information.

halbritt · on Jan 22, 2021

Was at a small startup, SAAS company, handful of large customers. Logs going into ES were north of 1TB/day.

Xylakant · on Jan 22, 2021

I’ve run logging infrastructure for a shopping platform, low 4 digit machine count. Ingest rates were around a handful terabytes a day.

mcintyre1994 · on Jan 21, 2021

It is, relative to the scale AWS ES is apparently built to support.

> Amazon Elasticsearch Service lets you store up to 3 PB of data in a single cluster, enabling you to run large log analytics workloads via a single Kibana interface.

https://aws.amazon.com/elasticsearch-service/

fastball · on Jan 21, 2021

That's the absolute maximum case though, literally.

"You're not the DoD so of course you're not having issues".

It can't be more than a minuscule fraction of Amazon ES customers that are getting anywhere close to 3PB.

halbritt · on Jan 22, 2021

Yes, that's a small amount of data. I've worked at small companies with an order of magnitude more data per day and larger companies with three orders of magnitude more data per day flowing into a text indexing service.

dawnerd · on Jan 21, 2021

We used to use ES, but after multiple issues with the index getting corrupt for various reasons we decided to actually save money and use Algolia (before their price structure change - now it would be more expensive). Over the last year+ we've had no issues with search or indexing.

inssein · on Jan 21, 2021

Might want to elaborate a bit.

It is true that from a feature point of view, Elastic's own offering wins out, but from a perspective of uptime, cost, and performance, AWS wins.

dijit · on Jan 22, 2021

I guess it depends on who’s managing your elk. I’ve been running elks stacks for just about 8 years in some form or another, AWS’ was very poor in availability compared to what it should have been.

There was a bunch of issues honestly, large messages got truncated, huge stalls in throughput (like 300-500ms of freezing), the fact that it was missing a bunch of features and was slower to query was just icing on the cake.

uncledave · on Jan 21, 2021

I’ll second that. Complete pile of excrement.

Edit: See my comment further down for an extrapolation. This one lacks merit which was a bad call on my part.

afandian · on Jan 21, 2021

What exactly is wrong? And did you switch to something else?

uncledave · on Jan 21, 2021

Outages during scale out, weird insurmountable restrictions on cross region replication, cost, backup and restore doesn’t work properly, cluster irrecoverable corruption and it is 100% opaque whenever something goes wrong. This results in escalation to AWS support who barely understand it.

The big sell was easy deployment and cost management but the negatives outweighed that.

We tried Elastic themselves but the sales and technical process was a nightmare. It was virtually impossible to work out what we were going to have to spend. We needed X-Pack (RBAC and SSO) but the enterprise licensing is all over the place for it.

Ergo we didn’t bother and stuck with our DBMS FTE stuff. It’s crap but the ROI is better.

ArchOversight · on Jan 21, 2021

Part of the issue is lack of visibility into what is going on, you don't get a lot of ability to debug.

Company I was at switched from AWS's ElasticSearch to hosting it ourselves on top of EC2 instances so that we could control the memory/CPU/disks and see logging/understand more about the internals to help understand why it was slow for our use case.