Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This - honestly depending on the task hundreds of GB can be still the "single computer" realm because it's just not worth it to set up a cluster in terms of time and money and also administration overhead. However parallel + out of core computation doesn't necessarily imply a cluster: single-node Spark or something like dask works fine if you're in the python world.


Setting up ad hoc (aka standalone) Spark cluster with a bunch of machines you have control over is ridiculously trivial task though. It's as easy as running spark --master=x where you designate one machine as master. All others started with --master=x become slaves of x. Then you just submit jobs to x and that's all.


Spark is slow though. On the other hand, Pandas is also extraordinarily slow :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: