Polyaxon – An open source platform for reproducible machine learning at scale

aflam · on June 10, 2018

It's great to see this sort of tool open-sourced. I am excited by new tools enabling better algo/ML engineering workflows.

In addition to the infra management, it's quickly tricky to support qa/viz/tuning/debugging tools for very different sorts of algorithms/outputs/configurations/metrics. Do you see your project going in those directions?

mmq · on June 10, 2018

I can't say much about long-term roadmap apart from the fact that the platform will be open source and that it will try to introduce features that will increase the productivity of data scientists.

For short-term roadmap, I am trying to work on stability, it's very hard to have default values since you don't know how it will be used, e.g. on Minikube or by a team scheduling a lot of parallel experiments, so what I am trying to do is at least having an automatic or simple way to scale workers responsible for scheduling, hyper params tuning, and monitoring.

For tuning, the platform will keep supporting some algorithms to automate the hyper params search, maybe introducing more priors for the Bayesian optimization, I also think more tests are needed to validate the behavior of the Bayesian optimization and Hyperband.

For visualization, currently, you can start a Tensorboard for any project created on the platform, but there are some problems with this assumption, if the project has a lot of experiments, Tensorboard becomes slow to irresponsive. Next release will introduce the possibility to create Tensorboard jobs per experiment or per hyper tuning experiment group, and possibly any collection of experiments to compare them.

The platform collectes already metrics from experiments, so a basic visualization is also planned to have a quick overview before diving into a Tensorboard.

And most importantly, I think there are some usability issues that need to be solved to make the experience better.

There are also a couple of ideas around team collaboration that will be introduced in the mid term.

aflam · on June 12, 2018

Thanks a lot for the answer! It's very promising.

mmq · on June 10, 2018

Hi, I am the author of Polyaxon, a bit late to notice, but thanks for sharing Polyaxon here. I will be around to answer questions, and any feedback is welcome.

mallochio · on June 10, 2018

Thanks a bunch for taking the effort to create this, and for making it open source.! The project looks amazing.

Could you maybe also explain what the target audience is? Are there any benefits for using Polyaxon in (solo) research projects on a cluster, or is it tailored towards production-ready environments at corporations?

mmq · on June 10, 2018

I think the target audience, is individuals or small teams who want to have an organized workflow, immutable and reproducible experiments with an organized and easy way to access logs and outputs.

The platform also provides a lot of automation to schedule concurrent experiments.

There are a couple of things that need to be polished to be used, notes on experiments, notification of finished experiments, especially if you are running hundreds of experiments.

Depending on how organized you are, many times you will end up with experiments that you did not know how you started, having a platform that takes care of that could be beneficial.

If you have already a cluster, for running your experiments you will most probably end up ssh-ing to the machine to check which experiments finished, probably in a screen, and their results and logs, Polyaxon simplifies that part as well.

jorgemf · on June 10, 2018

This is all the reasons why I want to use your tool.

Thanks for creating it and I hope one day soon to contribute to your project.

Peteris · on June 10, 2018

Can you talk about how your product is different from Databricks MLFlow?

mallochio · on June 10, 2018

Can someone tell me how this is different from/improves over pachyderm?

syllogism · on June 10, 2018

Pachyderm is a system for the nouns; this is a system for the verbs.

Polyaxon makes it easy to schedule training on a Kubernetes cluster. The problem this solves is that machine learning engineers generally spend too long running their jobs in series, rather than parallel. Instead of running one thing and waiting for it to finish, it's both more efficient and better methodology to plan out the experiments and then run them all at once.

Pachyderm is more concerned with versioning and asset management. It's more like Git+Airflow.

Let's say your experiment depends on training word vectors from Common Crawl dumps. You need to download the dump, extract the text you want, and train your word vectors models. Pachyderm is all about the problem of caching the intermediate results of that ETL pipeline, and making sure that you don't lose track of like, which month of data was used to compute these vectors. Polyaxon is all about the problem of, there are so many ways to train the word vectors and use them downstream. You want to explore that space systematically, by scheduling and automatically evaluating a lot of the work in parallel.

mmq · on June 10, 2018

I just want to add to the other comments that Polyaxon focuses on different aspects of machine learning reproducibility than Pachyderm, although Polyaxon will be providing a very simple pipelining abstraction to start experiments based on previous jobs, triggers, or schedules, or provide the possibility to run post-experiment jobs. It will not focus on data provenance the same way Pachyderm does. In fact, Polyaxon and Pachyderm could be used together.

jorgemf · on June 10, 2018

I don't know pachyderm but it seems to me quite similar to storm to create data pipelines. Polyaxon is useful to train deep learning models in a cluster. I couldn't find any example of how to do it in pachyderm (there are examples with only one node).

stared · on June 10, 2018

For reproducible ML I recommend Neptune - Machine Learning Lab https://neptune.ml/ (disclaimer: I work with people, who created it).

Not only it allows to run/enqueue things in the cloud, but also does very well tracking of source code (with code snapshots and git integration), parameters and output statistics (e.g. so you can select all models with #lstm tag, and sort by log-loss on the validation dataset).

sytelus · on June 10, 2018

Great to see infrastructure like this come along. I'm wondering what everyone else is using...

syllogism · on June 10, 2018

I've been starting to use this in spaCy, so I'm glad to see it posted here! It's still a young project, but Mourad has been very dedicated to it, and I think it's already at the point where it's useful. I hope more people can contribute. Here's a quick review.

Most people doing machine learning at the moment are using a pretty bad workflow. It's difficult to avoid the trap people refer to as "grad student descent": endless tinkering, where you run two or three jobs, monitor the results, and then kick off another one. You don't really have a hypothesis in this cycle, so you don't know when to stop. At the end of the process you've generally gained intuition and insight, but nothing you can reliably pass on.

The solution to this trap is to commit to a matrix of results you're going to collect, program up the experiments, and let them run. Once you have the proper comparison, you can then decide what to do next.

Most university research groups get a grant to buy some hardware, create a cluster, and then schedule jobs on the machines using SLURM or HTCondor. These technologies are mature, but they leave individual researchers with a lot to do. You can schedule your jobs, but you have to write the experiment management yourself. My hunch is there are maybe 5-50 companies in the world with internal systems significantly more sophisticated than this.

Polyaxon brings the shiny new "cloud native" workflow to the problem. It runs on Kubernetes, which is much easier to use, especially with heterogenous hardware. I think just switching to Kubernetes and containers would be helpful to a lot of teams. On top of the cluster solution, Polyaxon brings a nice experiment management layer, with hyper-parameter search. It also manages the containerisation for you, so that the researcher doesn't need to interact with Docker directly.

There are still a number of things that are under-developed. The most noticeable are dataset management and artifact export. You currently have to do this yourself, e.g. by adding persistent disks to the cluster. I use GCSFuse to mount GCS (Google S3) buckets as directories, which works pretty well in the meantime. There are also a few defaults that could use refinement. If large clusters are being created, the management services are currently a bit under-resourced. Finally, there are a few more minor rough spots. For instance, the web app is currently a little unpolished. It's a Django app, so everything takes two or three more clicks and refreshes than you'd ideally want. A more AJAXy front-end would be nice.

There are several commercial competitors. There's a pretty obvious analogy between the use-case here and the CRM. Companies hope to own the "system of record", and be the shared space where ML teams collaborate. DominoDataLab.com , Neptune.ml , Cloudera.com , Datascience.com and others all have slightly different takes on this problem, to different degrees. Many of the above are built around a Jupyter Notebooks-based experience, and are targeted more towards workflows where the primary outputs are insights and reports rather than product development.

I think an open-source framework is valuable for a few reasons. We should be reluctant to buy in to commercial platforms for precisely the reasons vendors are so interested in this space. Lock-in can hurt here, and if the computation is scheduled via the vendor, they get the chance to tax you a % of your compute spending. Given how expensive GPU experiments can be, that's a big ongoing cost to sign up for.

I also think uploading all your training data to someone else's system is a bad idea, that your data sharing agreements often won't permit.

Finally, it's nice to have a local Kubernetes cluster for other reasons. Kubernetes is basically an OS. Polyaxon is an app that runs on that OS. This is nice: you can develop other apps to work with it as well. In contrast, if you rent a service, the easiest way to meet your next requirements will be with further services. As soon as you hit custom requirements, costs and risks rise rapidly. The in-house approach signs you on to a better future --- it's a step in the right direction. The commercial services may or may not be easier today, but at some point you'll want to switch out of them. It may still be worth using them today --- but the long-term perspective is at least a point in Polyaxon's favour.

mmq · on June 10, 2018

Thank you for the great feedback, a lot of things are indeed planned to enhance the dashboard as well as the cli in terms of search, filtering, and ordering of the experiments based on some rule, e.g. parameters or metrics.

Also, some of your, and other users', feedback were also very helpful to bring some changes to the infrastructure for the next release, to maximize the usage of the cluster's resources.

albertzeyer · on June 10, 2018

It's only reproducible if the computation would be deterministic. This is often not the case for machine learning, esp TensorFlow on GPU. How do you deal with that?

mmq · on June 10, 2018

As you said if the library/framework provides deterministic computation, then it should not be a problem, I am not sure if it's already fixed in Tensorflow, but parallel computation and order of operations could all influence the reproducibility of the run, so sometimes you need to provide that in your code.

What Polyaxon provides, is a way to restart a run based on as many parameters collected, by default it will use same code based on the internal git commit, it will reuse the same configuration, same Dockerfile, and if provided, it will use same resources, CPU, GPU, and memory of the original run, if the experiment had a node selector, the restarted experiment will be as well scheduled on the same node.

amelius · on June 10, 2018

No support for (Py)Torch?

syllogism · on June 10, 2018

It supports PyTorch.

It also supports anything --- you give it a script and it builds a Docker container and runs it. You don't need to use any particular language or framework.

mmq · on June 10, 2018

As @syllogism said, you can use any framework/library to train your models.

A couple of libraries have a special integration, Tensorflow, MXNet, Horovod, and Pytorch, they have a section in the specification for users who want to run distributed experiments.