About 5 years ago I worked at a company that took the "pile of shell scripts" ap...

About 5 years ago I worked at a company that took the "pile of shell scripts" approach to processing data. Our data was big enough and our algorithms computationally heavy enough that a single machine wasn't a good solution. So we had a bunch of little binaries that were glued together with sed, awk, perl, and pbsnodes.

It was horrible. It was tough to maintain-- we all know how hard to read even the best awk and perl are. It was difficult to optimize, and you always found yourself worrying about things like the maximum length of command lines, how to figure out what the "real" error was in a bash pipeline, and so on. When parts of the job failed, we had to manually figure out what parts of the job had failed, and re-run them. Then we had to copy the files over to the right place to create the full final output.

The company was a startup and the next VC milestone or pivot was always just around the corner. There was never any time to clean things up. A lot of the code had come out of early tech demos that management just asked us to "just scale up." But oops, you can't do that with a pile of shell scripts and custom C binaries. So the technical debt just kept piling up. I would advise anyone in this situation not to do this. Yeah, shell scripts are great for making rough guesses about things in a pile of data. They are great for ad hoc exploration on small data or on individual log files. But that's it. Do not check them into a source code repo and don't use them in production. The moment someone tries to check in a shell script longer than a page, you need to drop the hammer. Ask them to rewrite it in a language (and ideally, framework), that is maintainable in the long term.

Now I work on Hadoop, mostly on the storage side of things. Hadoop is many things-- a storage system, a set of computation frameworks that are robust against node failures, a Java API. But above all it's a framework for doing things in a standardized way so that you can understand what you've done 6 months from now. And you will be able to scale up by adding more nodes, when your data is 2x or 4x as big down the line. On average, the customers we work with are seeing their data grow by 2x every year.

I feel like people on Hacker News often don't have a clear picture of how people interact with Hadoop. Writing MapReduce jobs is very 2008. Nowadays, more than half of our users write SQL that gets processed by an execution engine such as Hive or Impala. Most users are not developers, they're analysts. If you have needs that go beyond SQL, you would use something like Spark, which has a great and very concise API based on functional programming. Reading about how clunky MR jobs is just feels to me like reading an article about how hard it is to make boot and root floppy disks for Linux. Nobody's done that in years.