Yeah. A bit annoying to make a system to handle 5,000 events/sec (probably much more at peak) to throw away 90% of them though. Obviously you don't have an SLA on consumption time, but I imagine you can't lag too far behind assuming it's kafka or something.
Genuine question: We signed up for the deci/decahose last month and for now making everyday dumps of about 50GB in flat files.
What kind of system architecture would be good to load & search through such dataset. We tried to explore Mongo Atlas but it is coming out very expensive. Other alternative is to throw away most metadata & just keep ID & Tweet.
The access comes with a cost and is pretty expensive unless covered by some academic grants. (Its about 10 grand USD). How expensive would Clickhouse be for data increasing about 1-2 TB per month? We are trying to optimize costs as the Twitter purchase already seems steep
If you will use S3 as a storage for cold data it can be become just S3(or other object storage) cost. And ClickHouse compression rates for Twitter-like data can be 10x or better as you do not store raw data and data in stored in columns.
The example tweet, a somewhat large one with links and images, gzips down to 1.5KB by itself. Given that information, I think it would be reasonable to estimate 1KB per tweet in bulk.
Not it isn't.
It's not about storage but about analysing the data.
It's text so it could easily be compressed to smaller size but this doesn't help in knowing anything about the tweet author, if he is a bit or not.
That's like saying it's easy to check if a picture shows a bird because it's only 100kB.
Musk has one of the top AI researchers of his generation working for him(Karpathy). In general is rich enough to hire anyone he wants for consultation.
I’m boggled by the extent to which people mind-read Musk through his tweets. To me it’s obvious that his Twitter is an outlet for stress busting and banter, like smoke breaks.
If you have the firehose, couldn't you just only sample every 10th tweet and make a "decahose"[1] of your own to work with?
[1] Shouldn't this be "decihose" since it's 1/10 not 10x?