> In truth, Musk probably only needs the “decahose” API Twitter makes available ...

kareemsabri · on June 10, 2022

Yeah. A bit annoying to make a system to handle 5,000 events/sec (probably much more at peak) to throw away 90% of them though. Obviously you don't have an SLA on consumption time, but I imagine you can't lag too far behind assuming it's kafka or something.

srvmshr · on June 10, 2022

Genuine question: We signed up for the deci/decahose last month and for now making everyday dumps of about 50GB in flat files.

What kind of system architecture would be good to load & search through such dataset. We tried to explore Mongo Atlas but it is coming out very expensive. Other alternative is to throw away most metadata & just keep ID & Tweet.

qoega · on June 10, 2022

Look on ClickHouse. I think you will be pleasantly surprised how fast and effective it would be.

By the way, you have to pay for this access or there are some other options available? It is interesting to ingest all this data by ourselves

srvmshr · on June 10, 2022

The access comes with a cost and is pretty expensive unless covered by some academic grants. (Its about 10 grand USD). How expensive would Clickhouse be for data increasing about 1-2 TB per month? We are trying to optimize costs as the Twitter purchase already seems steep

qoega · on June 10, 2022

If you will use S3 as a storage for cold data it can be become just S3(or other object storage) cost. And ClickHouse compression rates for Twitter-like data can be 10x or better as you do not store raw data and data in stored in columns.

Concerning ingestion costs, I think it took me ~10 hours of 4vCPU vm last time, when I loaded 1.5 TB data to ClickHouse from this dataset: http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-ta... 1.1 billion rides became 3.5 billion already.

srvmshr · on June 10, 2022

Interesting. Would you mind talking offline (email in my bio). Would love to take some pointers

Nextgrid · on June 10, 2022

If you don't mind a proprietary hosted service, try Google BigQuery.

pram · on June 10, 2022

Typical use case for Presto or BigQuery. Putting it in S3 and using Athena wouldn't take very long.

nojito · on June 10, 2022

Clickhouse will churn through it.

webo · on June 10, 2022

>If the average length of a tweet (minus headers) is ~100 bytes of text, that’s only 50GB. You could fit it on a USB stick.

https://twitter.com/elonmusk/status/1534939289653592065

ceejayoz · on June 10, 2022

You can't do bot detection with just the tweet text.

Tweets come with a lot more data than that.

Tweet object model: https://developer.twitter.com/en/docs/twitter-api/data-dicti...

Example payloads: https://developer.twitter.com/en/docs/twitter-api/data-dicti...

webo · on June 10, 2022

Yeah, I was just highlighting how ignorant Musk is (sarcastically). This must’ve been what he meant by “hardcore software engineering.”

Dylan16807 · on June 10, 2022

The example tweet, a somewhat large one with links and images, gzips down to 1.5KB by itself. Given that information, I think it would be reasonable to estimate 1KB per tweet in bulk.

500 million times 1KB is... still one USB stick, not even $50 https://amazon.com/SanDisk-512GB-Ultra-Flash-Drive/dp/B083ZL...

Or going by the retweeted quote tweet that's 40% bigger zipped, it's still well within 1TB which is a normal size for a flash drive.

I'm not very fond of Musk but this is a weak criticism.

croes · on June 10, 2022

Not it isn't. It's not about storage but about analysing the data. It's text so it could easily be compressed to smaller size but this doesn't help in knowing anything about the tweet author, if he is a bit or not.

That's like saying it's easy to check if a picture shows a bird because it's only 100kB.

Dylan16807 · on June 10, 2022

The tweet he replied to sure seemed to be about volume of data to me.

jmeister · on June 10, 2022

Musk has one of the top AI researchers of his generation working for him(Karpathy). In general is rich enough to hire anyone he wants for consultation.

I’m boggled by the extent to which people mind-read Musk through his tweets. To me it’s obvious that his Twitter is an outlet for stress busting and banter, like smoke breaks.

dubswithus · on June 10, 2022

Karpathy is on sabbatical and rumored to be interviewing elsewhere.

naveen99 · on June 10, 2022

Hmm… with general AI around the corner, he should be getting paid more than steph curry.

defterGoose · on June 10, 2022

Re: your question, the answer is yes.