The AWS version of ES has been abysmal- it’s only saving grace is that it’s “in the ecosystem”- I was convinced by an AWS zealot on my team. Never again.
Personally me and my team just evaluated the AWS ES and Elastic offerings, and the AWS offering (surprisingly!) came out on top, for our use case. Better performance, better IaC support, and marginally cheaper.
Honestly I would have preferred the Elastic offering to work better, but that wasn't the case.
I send about 100GB a day to Amazon ES and it works fine. I used to maintain ES 2.x and 5.x on my own and it was more work for me personally at a slight cost savings.
What has been abysmal for you? Maybe your use case is more advanced than ours, which is mainly absorbing logs from all over the place and doing the typical dashboard and alerts on them (with Grafana).
I've seen devs toss crud into infra with debug logs enabled, with millions of lines of deprecated log messages, etc., and the infra budget eats their costs.
It's insane. Unless you're literally Facebook, or ingesting data from CERN's LHC... what possible use case requires 100GB of text data ingest per day?
Maybe it's a case of someone throwing a Service Mesh into a Microservices K8s cluster and logging all the things?
4-5MB/min per VM for input/output traffic in compressed logs of application servers. Around 100ish VMs/site. That's a half GB/min. 700GB+/day in plain text logs per a single site from app servers alone.
Normally that's no issue as the data is stored in SANs and not sent onto the cloud for analysis, just giving a perspective.
It's absolutely caused by exactly those things you've mentioned. I think we could drop it down by 75% easily if we simply had people putting severity levels in correctly and disabled storing debug logs except in experimental environments.
90%+ of our logs are severity INFO or have no severity at all. It's like pulling teeth to even get devs to output logs using the corporate standard json-per-line format with mandatory fields.
Still, once you're running hundreds of VMs processing a big data pipeline it's not hard to end up with massive amounts of logs. It's not just logging, really, it's also metrics and trace information.
It is, relative to the scale AWS ES is apparently built to support.
> Amazon Elasticsearch Service lets you store up to 3 PB of data in a single cluster, enabling you to run large log analytics workloads via a single Kibana interface.
Yes, that's a small amount of data. I've worked at small companies with an order of magnitude more data per day and larger companies with three orders of magnitude more data per day flowing into a text indexing service.
We used to use ES, but after multiple issues with the index getting corrupt for various reasons we decided to actually save money and use Algolia (before their price structure change - now it would be more expensive). Over the last year+ we've had no issues with search or indexing.
I guess it depends on who’s managing your elk. I’ve been running elks stacks for just about 8 years in some form or another, AWS’ was very poor in availability compared to what it should have been.
There was a bunch of issues honestly, large messages got truncated, huge stalls in throughput (like 300-500ms of freezing), the fact that it was missing a bunch of features and was slower to query was just icing on the cake.
Outages during scale out, weird insurmountable restrictions on cross region replication, cost, backup and restore doesn’t work properly, cluster irrecoverable corruption and it is 100% opaque whenever something goes wrong. This results in escalation to AWS support who barely understand it.
The big sell was easy deployment and cost management but the negatives outweighed that.
We tried Elastic themselves but the sales and technical process was a nightmare. It was virtually impossible to work out what we were going to have to spend. We needed X-Pack (RBAC and SSO) but the enterprise licensing is all over the place for it.
Ergo we didn’t bother and stuck with our DBMS FTE stuff. It’s crap but the ROI is better.
Part of the issue is lack of visibility into what is going on, you don't get a lot of ability to debug.
Company I was at switched from AWS's ElasticSearch to hosting it ourselves on top of EC2 instances so that we could control the memory/CPU/disks and see logging/understand more about the internals to help understand why it was slow for our use case.
The AWS version of ES has been abysmal- it’s only saving grace is that it’s “in the ecosystem”- I was convinced by an AWS zealot on my team. Never again.