Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Big Data was whatever someone couldn't handle in a spreadsheet or on their laptop using R.

This paper is 8 years old and it was somewhat obvious then.

Scalability! But at what COST? https://www.usenix.org/system/files/conference/hotos15/hotos...

A big single machine can handle 98% of peoples data reduction needs. This has always been true. Just because your laptop only has 16GB doesn't mean you need a Hadoop (or Spark, or Snowflake) cluster.

And it was always in the best interest of the BD vendors and Cloud vendors to say, "collect it all" and analyze on/or using our platform.

The future of data analysis is doing it at the point of use and incorporating it into your system directly. Your actionable insights should be ON your grafana dashboard seconds after the event occurred.



My experience with "Big Data" is it was something that couldn't be handled in a spreadsheet or on their laptop using R because it was so inefficiently coded.

I got sucked into "weekly key metric takes over 14 hours to run on our multi-node kubernetes cluster" a while back. I'm not sure how many nodes it actually used, nor did I really care.

Digging into it, the python code ingested about ~50GB of various files, made well over a dozen copies of everything, leaving the whole thing extremely memory starved. I replaced almost all of the program with some "grep | sed | awk | sed | grep" abomination that stripped about 98% of the unnecessary info first and it ran in under 2 minutes on my laptop. I probably should have tightened it up more but I was more than happy to wash my hands of the whole thing by that point.

Instead of improving the code, they just kept tossing more compute at it. Still heard all kinds of grumbling about os.system('grep | sed | awk | sed | grep') not being "pythonic" and "bad practice"; but not enough that they actually bothered to fix it.


That is one of the selling points of Hadoop, you can write garbage code and scale your way out of any problem, turning the $$$ knob up to more nodes.


Yeah, that's why I got involved (I was infrastructure at the time) - how can we throw more hardware at it as the kubernetes setup they had wasn't cutting it.

One of the "data scientists" point blank said in a meeting "My time is too valuable to be spent optimizing the code, I should be solving problems. We can always just buy more hardware".

Admittedly the last little bit of analysis was pretty cool, but >>99% of that runtime was massaging all of the data into a format that allowed the last step to happen.


Snowflake too.

Inefficient sql? Crank the virtual warehouse.


Spending more money on compute also makes you and the team look more important, an d the problems being tackled, more challenging.


You can do a petabytes of analysis with regular old BigQuery just as easily as you can analyze megabytes of data. This solves the scalability issue for a lot of companies, IMHO.


I agree, BQ is a gem on GCP. You pay for storage (or not, you can use federated queries) and don't pay anything when you aren't using it. The ability to dynamically scale reservations is pretty nice as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: