Bottom line is - you do not need hadoop until you cross 2TB of data to be proces...

earino · on Jan 19, 2015

Dell sells a server with 6TB of ram (I believe.) I think the limit is way over 2TB. If you want to be able to query it quickly for analytical workloads, MPPs like Vertica scale up to 150+TB (at Facebook.) I honestly don't know what the scale is where you need Hadoop, but it's gotten to be a large number very quickly.

juliangregorian · on Jan 19, 2015

They do, I checked. It comes in at a cool half million (Helloooo, investors!)

virmundi · on Jan 19, 2015

My question is what do you mean by 2TB? At my current client, we have 5 TBs of data sitting (that's relatively recent). Before we had 2-ish. However, we had over 30 applications doing complex fraud calculations on that. "Moving data" (data being read and then worked) is about 40 TB daily. Even with SSD and 256 GB of RAM, a single machine would get overwhelmed on this.

If you're only working one app on less than 1 TB, maybe you don't need something as complex as Hadoop. But given that a cluster is easy to setup (I made a really simple NameNode + Two Data nodes in 45 minutes, going cold), it might not be a bad idea.

I'll take this further and say that some tools for Hadoop that are not from Apache are really nice to work with even in a for non-Hadoop work. For example, I've got to join several 1 GB files together to go from a relational, CSV model into a Document store model. Can I do this with command line tools? Maybe. Cascading makes this really easy. Each file family is a tap. I get tuple joins naturally. I wrote an ArangoDB tap to auto load into ArangoDB. It was fun, testable and easy. All of this runs sans-hadoop on my little MBP.

Fun fact about the Cascading tool set is that I can take my little app from my desktop and plop it onto a Hadoop cluster with little change (taps from local to hadoop). Will I do that in my present example? No. Can I think of places where that's really useful? Yes, daily 35 fraud models' regression tests executed with each build. That's somewhere around 500 full model executions over limited, but meaningful data. All easily done courtesy of a framework that targets Hadoop.

treve · on Jan 18, 2015

What makes 2TB the cutoff?

wobbleblob · on Jan 19, 2015

I think the consensus is that as long as your data fits on a single (affordable) machine, 'big data' tools are probably not the best solution.