Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> In truth, Musk probably only needs the “decahose” API Twitter makes available to some researchers, which is 10% of all tweets.

If you have the firehose, couldn't you just only sample every 10th tweet and make a "decahose"[1] of your own to work with?

[1] Shouldn't this be "decihose" since it's 1/10 not 10x?



Yeah. A bit annoying to make a system to handle 5,000 events/sec (probably much more at peak) to throw away 90% of them though. Obviously you don't have an SLA on consumption time, but I imagine you can't lag too far behind assuming it's kafka or something.


Genuine question: We signed up for the deci/decahose last month and for now making everyday dumps of about 50GB in flat files.

What kind of system architecture would be good to load & search through such dataset. We tried to explore Mongo Atlas but it is coming out very expensive. Other alternative is to throw away most metadata & just keep ID & Tweet.


Look on ClickHouse. I think you will be pleasantly surprised how fast and effective it would be.

By the way, you have to pay for this access or there are some other options available? It is interesting to ingest all this data by ourselves


The access comes with a cost and is pretty expensive unless covered by some academic grants. (Its about 10 grand USD). How expensive would Clickhouse be for data increasing about 1-2 TB per month? We are trying to optimize costs as the Twitter purchase already seems steep


If you will use S3 as a storage for cold data it can be become just S3(or other object storage) cost. And ClickHouse compression rates for Twitter-like data can be 10x or better as you do not store raw data and data in stored in columns.

Concerning ingestion costs, I think it took me ~10 hours of 4vCPU vm last time, when I loaded 1.5 TB data to ClickHouse from this dataset: http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-ta... 1.1 billion rides became 3.5 billion already.


Interesting. Would you mind talking offline (email in my bio). Would love to take some pointers


If you don't mind a proprietary hosted service, try Google BigQuery.


Typical use case for Presto or BigQuery. Putting it in S3 and using Athena wouldn't take very long.


Clickhouse will churn through it.


>If the average length of a tweet (minus headers) is ~100 bytes of text, that’s only 50GB. You could fit it on a USB stick.

https://twitter.com/elonmusk/status/1534939289653592065


You can't do bot detection with just the tweet text.

Tweets come with a lot more data than that.

Tweet object model: https://developer.twitter.com/en/docs/twitter-api/data-dicti...

Example payloads: https://developer.twitter.com/en/docs/twitter-api/data-dicti...


Yeah, I was just highlighting how ignorant Musk is (sarcastically). This must’ve been what he meant by “hardcore software engineering.”


The example tweet, a somewhat large one with links and images, gzips down to 1.5KB by itself. Given that information, I think it would be reasonable to estimate 1KB per tweet in bulk.

500 million times 1KB is... still one USB stick, not even $50 https://amazon.com/SanDisk-512GB-Ultra-Flash-Drive/dp/B083ZL...

Or going by the retweeted quote tweet that's 40% bigger zipped, it's still well within 1TB which is a normal size for a flash drive.

I'm not very fond of Musk but this is a weak criticism.


Not it isn't. It's not about storage but about analysing the data. It's text so it could easily be compressed to smaller size but this doesn't help in knowing anything about the tweet author, if he is a bit or not.

That's like saying it's easy to check if a picture shows a bird because it's only 100kB.


The tweet he replied to sure seemed to be about volume of data to me.


Musk has one of the top AI researchers of his generation working for him(Karpathy). In general is rich enough to hire anyone he wants for consultation.

I’m boggled by the extent to which people mind-read Musk through his tweets. To me it’s obvious that his Twitter is an outlet for stress busting and banter, like smoke breaks.


Karpathy is on sabbatical and rumored to be interviewing elsewhere.


Hmm… with general AI around the corner, he should be getting paid more than steph curry.


Re: your question, the answer is yes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: