Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Once your datasets go out of the bounds of single reasonable machine, it's time to switch to Apache Spark cluster (or similar).

You can still write your data analysis code in Python, but you get to leverage multiple machines and intelligent compute engine that knows how to distribute your computation across nodes automatically, keeping data linkage and parentage information, so computation is moved closest to where data is located.



You know, sometimes you are in that uncomfortable spot where you have too much data for a single laptop but too little to justify running a whole computing cluster.

That is the kind of spot where you max out everything you can max out and just go take a break when something intensive is running.


This - honestly depending on the task hundreds of GB can be still the "single computer" realm because it's just not worth it to set up a cluster in terms of time and money and also administration overhead. However parallel + out of core computation doesn't necessarily imply a cluster: single-node Spark or something like dask works fine if you're in the python world.


Setting up ad hoc (aka standalone) Spark cluster with a bunch of machines you have control over is ridiculously trivial task though. It's as easy as running spark --master=x where you designate one machine as master. All others started with --master=x become slaves of x. Then you just submit jobs to x and that's all.


Spark is slow though. On the other hand, Pandas is also extraordinarily slow :D


Then you remote into a workstation as some one else in this thread said they did.


Running distributed like that always has a cost, both in inefficiency of the compute and in person-time.

If you still can run on one machine, it's almost always a win. 32Gb is a perfectly reasonable amount of memory to expect. 64Gb isn't outlandish at all for a workstation.


Really depends on the computation.. what you say only make sense for some niches of computations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: