How to use R, H2O, and Domino for a Kaggle competition

izyda · on Sept 22, 2014

I do not understand start ups like Domino. It seems to me like it is essentially the equivalent of running an AWS instance along with a git hub account. AWS does not require any sort of hardware maintenance on your part and it takes only a tutorial or two to learn how to install R and run code on it in parallel / across multiple instances.

Presumably, Domino does not take unparallelized R code and transform it into parallelized code - so if you have to use the parallels R package (or some equivalent) anyway to get it to run on multiple cores, what really is the value add? Am I missing something?

DominoDataLab · on Sept 22, 2014

Hi izyda. We get that question a lot, and we're working on our messaging around this, so I appreciate the feedback. Here are some reasons our customers find Domino valuable:

- Domino makes it really easy to start and manage multiple runs in parallel (think a modern, easy-to-use cluster). If you're doing all this directly with AWS, you're quickly running into pain points managing all your instances and images.

- Domino auto automatically keeps a revisioned history of your work. It supports large files like data sets (which git can't handle) and it tracks the results/artifacts of your analysis (which makes it more like git + CI). These things are critical to analytics workflows, rather than pure software development.

- Domino lets you deploy your analyses as self-service web UI tools, or deploy them to API endpoints. Doing this on your own would involve building an entire web stack around your analysis.

- Domino hosts your analysis centrally so you can share and collaborate with others (yes, this is like github, but on a platform that has all the benefits above).

- The entire product can be installed on-premise, so companies can use the functionality described above without going to the cloud if they don't want to.

Finally, even for pure infrastructure management, we've found that many data scientists don't want to spend their time dealing with system administration. It's true that it's not that hard to start an EC2 instance. But pretty quickly you're installing packages (perhaps in an environment you aren't used to), dealing with security groups, file transfer (configuring S3), etc. People use Domino for the same reason they use Heroku: yes, you could deal with all that, but it might be a better use of your time to let someone else do it.

izyda · on Sept 24, 2014

Thanks for the response - there are some fair points here.

As other commenters pointed out, the fact that you charge by minute not hour does in fact make a big difference in price, particularly for those of us that need to run intensive but sporadic/short tasks.

A few question about your points that I am trying to find in documentation right now but perhaps you can save me the trouble if you happen to see this first:

> Domino makes it really easy to start and manage multiple runs in parallel (think a modern, easy-to-use cluster). If you're doing all this directly with AWS, you're quickly running into pain points managing all your instances and images.

How so? Does Domino allow you spin up more cores at will from R? That would be awesome.

> - Domino lets you deploy your analyses as self-service web UI tools, or deploy them to API endpoints. Doing this on your own would involve building an entire web stack around your analysis.

This is awesome and definitely useful if you are doing work for clients and do not want to be bothered with spending too much time building production grade stuff. In some sense is this like yhathq.com? (I understand you guys do more than they do in the sense you provide all these other features).

hgbrian · on Sept 22, 2014

One nice feature is that Domino charges by the minute where AWS charges by the hour, so if you want to run a 32 CPU job for 5 minutes, it only costs you 30c. Domino's prices are 2X AWS per minute for this reason. I have tried Domino out, and it's pretty nice, but I'm not sure if I will end up using it long term. Having 32 CPUs is not quite enough benefit over my 8 core laptop, and as you say, more complex parallelization requires you to do the heavy lifting yourself.

dcraw · on Sept 22, 2014

I think it's similar to why people don't just set up an AWS box and cron rsync instead of using Dropbox. Or the same benefit as using Heroku over AWS directly. If you're not deeply familiar with server administration, there are a hundred or so snags in managing synchronization of code and data between systems, getting the remote system set up to run the same code as your local system, and managing all the runs and results.

dbecker · on Sept 22, 2014

I've used both Domino and AWS.

AWS isn't hard to use, but Domino is much easier and more streamlined for analytics.

My recollection is also that AWS rounds charges up to the hour, whereas domino does not.

I didn't know spinning up analysis in the cloud could be so easy and frictionless before trying domino.

dxbydt · on Sept 22, 2014

Can you please comment on why you need a 50 node 3 hidden layer ffnn to do regression, as opposed to something simpler ?

jofai_chow · on Sept 22, 2014

The starter code I provided is a basic DNN structure for modelling complex non-linear relationships between five soil properties and 3000+ predictors.

In practice, I found that some of the properties require even more complex DNN structure to achieve better predictive accuracy. The 50-50-50 setup is a very solid starting point for the readers to begin their own experiments.

dxbydt · on Sept 22, 2014

Thank you. How did you come up with the 50-50-50 setup, or was it purely empirical ? Did you try something simpler first, and how did that simpler method perform vis-a-vis this DNN ? Congratulations on topping the leaderboard.

jofai_chow · on Sept 23, 2014

Thanks! Yes, I always start with much simpler networks like 10, 10-10, 10-10-10. Unfortunately, the regression problems here are quite complex hence bigger networks are required (well, it wouldn't be on Kaggle otherwise).

minimaxir · on Sept 22, 2014

You can do simple regressions (linear, logistic) natively and easily in R. The tutorial is demonstrating H2O by using an algorithm (deep learning) that's not native to R.

larrydag · on Sept 22, 2014

One reason could be that its a non-linear prediction.