Big Data was whatever someone couldn't handle in a spreadsheet or on their lapto...

angry_moose · on Feb 7, 2023

My experience with "Big Data" is it was something that couldn't be handled in a spreadsheet or on their laptop using R because it was so inefficiently coded.

I got sucked into "weekly key metric takes over 14 hours to run on our multi-node kubernetes cluster" a while back. I'm not sure how many nodes it actually used, nor did I really care.

Digging into it, the python code ingested about ~50GB of various files, made well over a dozen copies of everything, leaving the whole thing extremely memory starved. I replaced almost all of the program with some "grep | sed | awk | sed | grep" abomination that stripped about 98% of the unnecessary info first and it ran in under 2 minutes on my laptop. I probably should have tightened it up more but I was more than happy to wash my hands of the whole thing by that point.

Instead of improving the code, they just kept tossing more compute at it. Still heard all kinds of grumbling about os.system('grep | sed | awk | sed | grep') not being "pythonic" and "bad practice"; but not enough that they actually bothered to fix it.

nerpderp82 · on Feb 7, 2023

That is one of the selling points of Hadoop, you can write garbage code and scale your way out of any problem, turning the $$$ knob up to more nodes.

angry_moose · on Feb 7, 2023

Yeah, that's why I got involved (I was infrastructure at the time) - how can we throw more hardware at it as the kubernetes setup they had wasn't cutting it.

One of the "data scientists" point blank said in a meeting "My time is too valuable to be spent optimizing the code, I should be solving problems. We can always just buy more hardware".

Admittedly the last little bit of analysis was pretty cool, but >>99% of that runtime was massaging all of the data into a format that allowed the last step to happen.

mejakethomas · on Feb 7, 2023

Snowflake too.

Inefficient sql? Crank the virtual warehouse.

dcl · on Feb 8, 2023

Spending more money on compute also makes you and the team look more important, an d the problems being tackled, more challenging.

mywittyname · on Feb 7, 2023

You can do a petabytes of analysis with regular old BigQuery just as easily as you can analyze megabytes of data. This solves the scalability issue for a lot of companies, IMHO.

nerpderp82 · on Feb 7, 2023

I agree, BQ is a gem on GCP. You pay for storage (or not, you can use federated queries) and don't pay anything when you aren't using it. The ability to dynamically scale reservations is pretty nice as well.