Also there's a level of automation for model tuning (experiments) especially if this is an update to an existing model (you still need to do model validataion). For the initial model there's gonna be automation but a lot of time will be spent validating the model with business KPI, not just some model metric like accuracy. Databricks makes this automation and tracking pretty simple. You don't have to do this in notebooks. Once you're done with exploration, build out the pipeline in your "framework" and kick off jobs to execute it on databricks. Yes you can use ETL tools here to, I dunno, train every x hours...whatever
Each round trip of model training (such as when optimizing hyperparameters) should be a distinct run through the ETL stages, with dedicated artifact tracking. Observability and monitoring for ML model training ETLs should always include whatever end criteria the business uses to judge the model, such as an ETL step for real acceptance testing, not just domain specific accuracy like F1 score or ROC curves.
I agree, but you're looking at everything through an automated ETL workflow which sorta invoke the idea of using a tool like airflow. But generally you don't need that. What you need is a specific set of functionality (training, tuning, testing) so you can either kick off manually when you wanna do things interactively or hook up into a pipeline when you wanna automate things. And yes you can track everything using MLFlow on databricks. Matter of fact Delta Lake is one of the most powerful features as you track not just migrations but data changes, so you can track end-to-end lineage and so far it's the ONLY platform (that I've tested) that allows this with ease.
I think this is wrong. This is what ML researchers tend to think when they are ignorant of why DevOps best practices exist and what other teams need in order to provide underlying infrastructure support. You’re only thinking of “what you need” from the point of view of the developer experience of the ML engineer, which frankly is usually the least important part by a wide margin.
Can't disagree with you more here. Databricks is a tool. It's on you to create your frameworks or whatever to optimize your workflow. I've been using Databricks and it's such a great platform, at least compared to what I've used in the past (EMR/Sagemaker). Simply build a framework where you offload the notebooks and exploratory stuff to databricks and build out productionable pipelines locally using sample data from dbricks, then test on databricks. That's what we've been doing and it's working very nicely. Furthermore you can remotely kick off jobs on databricks, so once you have your pipeline worked out you really don't need to interact with notebooks if you don't want...just use it as a scalable backend, dashboard, model tracker, whatever fits in your workflow.
I feel like instead of using the tool a lot of folks have specific expectations and if the tool doesn't fit exactly into how they work then it's just written off.
The trouble is that those expectations exist for a reason, and there’s a whole world of best practices and efficient patterns that have developed specifically to solve the challenges of orchestrating and monitoring ETLs. What feels convenient for researchers using a notebook is just fool’s gold (for known, legit reasons), and it damages credibility with other domains of engineering when their widely vetted best practices are ignored just for the sake of convenience tools like notebooks or manual job kickoff interfaces through a vendor tool like Databricks.
They (Databricks) are not advertising themselves as a best practices way of doing anything, they're just a platform for doing data science and analytics. It's up to the data scientist to use the tool properly and to have proper methodologies. A platform is simply a set of tools that helps a target audience, and that's what databricks is. Now the issue here is that a lot of datascience either cannot code/do engineering or just don't want to. But in my view people like that are just glorified analysts. Data Science is built on the foundation of software engineering (+ stats, viz, math, etc...); this is why it's so complicated. If you cannot code or don't understand the SDLC or best practices you're basically a carpenter who can't hammer or saw.
The question is not whether Databricks claims to be anything. The question is just what is a best practice, from an ETL / DevOps point of view, and how to enforce ML tooling to adhere to that from first principles.
I’ll give a concrete example. In my org we use Dataproc on GCP as a model training task execution paradigm. You define your base environment via some Docker container, put it in GCR, and then define Dataproc jobs in terms of the base environment, the backing compute resources, any GCS bucket connections, and any ML-specific config like hyperparameters.
A human being never under any circumstances triggers these jobs. Instead a human user deploys the config as a cronjob or regular job in Kubernetes, and then a scheduler picks them up and runs them. For experimental workloads only, developers can manually trigger a Kubernetes job.
Each job consults config, spins up the appropriate Dataproc cluster, runs the job (with visualization tools exposed on ports at the cluster node IPs), and saves artifacts to GCS when done.
All of this is controlled via clean and easy internal CLI tools and wrappers to make it simple for any developer.
The number one thing this ensures is that no work ever exists in notebook format, beyond tiny scratch work a developer might do strictly to debug code or try a small data proof of concept.
The number two thing this ensures is complete reproducibility. Since every possible training task must go through code review, commit all config to version control, get impounded into a container, and execute via a deployed Kubernetes job, it is by definition impossible for someone to execute an ad hoc task that other engineers can’t rerun or have to follow weird setup steps to recreate (it’s all impounded in the container).
The third thing it ensures is that all accuracy, monitoring and results artifacts are explicitly tied to the Kubernetes job that controlled the process. It is not possible for some accuracy result to float around untethered from a job ID that uniquely and conclusively ties it to all relevant code for the job. This can be facilitated through MLflow or whatever else.
Getting data scientists to “wear corrective shoes” and learn to reorient their way of working to align it with this process has universally paid dividends, both for letting the data scientists experiment faster and more reliably, and for ensuring model training adheres to SRE-related compliance and best practices, so it is pluggable into various tools and constraints those non-ML support teams need in order to do their jobs and offer support to ML teams without getting hit with unstructured notebook spaghetti and bespoke execution paradigms.
Databricks is not an RDBMS...it's a Unified Analytics Platform (their words i know), but that really describes it well. It's been really liberating to having easy access to versioned data at your fingertips when iterating on models...and also being able to track models and share work. It does have some warts but it's much better than any other single platform I've used for data science/ML
What are you trying to say? over 16 people are just lieing! Whatever dude, you're privilege to not have to face the bs. Then why don't you go get the details before you start talking smack
i don't watch tv anymore due to the excessive 'spin' and sometimes straight out lies. usually they start off by saying "allegedly" and by the time they're done you'd think it's one of Newton's laws of motion