I'm deeply disappointed in Databricks as an RDBMS.
As a DS/DE, there's a lot to love (not all, but a lot). The easy provision of Spark clusters. The jobs API. DeltaLake (mostly). Easy notebooks (please don't create a prod system from these..). And Spark itself continues to improve, albeit in an increasingly crowded field.
But I've worked closely with BigCo SQL analysts on Azure Databricks, and their experience was terrible. For example:
- You cannot browse the data structure without an active cluster
- Starting a cluster can take ~5 minutes and, since you missed that moment, you may not submit your first query until 10-15 minutes.
- The SQL error messages are often (perhaps usually?) nonsense, so you have to operate without them.
- An unfortunate amount of downtime, followed by bizarre excuses.
- It's so darn slow, relative to equivalent queries on BigQuery or Snowflake.
- Even submitting a query can take a weird amount of time.
If Databricks-as-an-RDBMS were competing against Teradata, sure, let's have a chat.
But we're in 2021, and there's just no comparing the experience of the SQL analyst on Databricks-as-an-RDBMS vs. Snowflake/BigQuery.
I'm excited for the potential of Snowflake's SnowPark (though know little about it). Calling UDFs from SQL means you can create great features for SQL analysts, provided that they can build the momentum to need it.
Teradata is shared-nothing architecture. Of course it will outperform Databricks and Snowflake shared-disk model as the data is colocated on the compute nodes so it doesn't have to travel anywhere.
Good luck on your budget trying to scale up shared-nothing database and making it scale up and down based on workload without downtime.
You can achieve significant speedup resembling shared-nothing databases by pushing the data close to the query using caching. Snowflake does it out of the box as it maintains table metadata. Databricks can do it too, but you have to be careful and it sucks.
Wait what? I thought spark is designed to work on top of a shared nothing cluster? Have you any resources describing the architecture of databricks specifically?
If Teradata is faster for workload X or Y, it isn’t because of a shared nothing architecture. Databricks and Snowflake both cache the working set on local disks of the worker nodes, so in practice they are shared nothing systems from a performance perspective.
Do you have any resources or examples of challenges with scaling up a shared-nothing architecture? At the crossroads between a shared-nothing and shared disk decision at the moment.
With shared-disk the data sits in cold storage (eg. s3) for which you pay cheaply. Only when you have to do some big computation you spin up a temporary compute cluster, do the computation, store resukt back in s3 and kill the cluster. You can also have long lived compute clusters and shut them or scale them down at night when nobody uses them. Since the data doesn't "live" in the compute nodes, you can add or remove as many compute nodes as you wish.
In shared-nothing the data "lives" on the compute nodes, so you can't willy nilly add or remove nodes. The data would either get lost if you removed more nodes than what is necessary for triple replicated redundancy. Or if you added nodes you'd have to wait before the data gets rebalanced to those new nodes, resulting in massive reshuffling of everything.
Keep in mind that shared-nothing clusters are most likely long lived and have multiple large datasets sitting on them, so by adding nodes you will start to shuffle everything around.
With shared-disk your compute cluster is only a single use for one dataset and you don't lose data as you scale cluster up and down.
There is a hybrid architecture which will give you both, but doesn't exist out there: Shared-disk with active caching. Ie. giving you control over which tables will be pinned down on your temporary compute-cluster. That will give you performance of shared-nothing but with convenience of shared-disk temporary compute clusters.
They use DeltaLake + Spark 3.0, and are mostly careful to partition well.
Their datasets are small. Most tables are ~50GB, the odd table up to ~2TB. The clusters typically are nothing shabby for this size, defaults to ~[4-12]x32GB.
The queries that I have seen are typically not written well. Think view-on-view-on-view (there's a BigCo policy against them materialising data..), and where the filter is applied in the last step. The stuff of horrors, but something I've seen in more-than-one-BigCo.
But we have compared some of those same queries on BigQuery vs. Databricks, and, I don't know if BigQuery's execution optimiser is better? Or if the BigQuery storage is better organising the data? Or if BigQuery is simply throwing more resource their way?
> Think view-on-view-on-view (there's a BigCo policy against them materialising data..), and where the filter is applied in the last step. The stuff of horrors, but something I've seen in more-than-one-BigCo.
That sounds wonderful (really). I was contracting for a BigCo where they materialised things all the time, and they would regularly end up running queries over multiple materialisations from different points in time, which invariably means that you always get wrong answers.
I very much wished to put a stop to use of any materilised views, but didn't have the buy-in to make the policy.
Was going to say. Most of the times all it takes is to have a proper data model.
For analytics I favour de-nomarlized schemas and, if necessary, nested fields. Queries are much easier to write (fewer joins), much faster, no need to incrementally materialize (sigh), fewer backfills and no messy field definitions.
What you often see instead is highly-normalized data models with an un-trackable amount of materialized views (usually on top each others) and some complicated tools/solutions to try to deal with all that mess. The cost of a bad design.
In my experience Athena on AWS beats Spark by an order of magnitude in terms of performance and price. Without the annoyance of having to start/stop/run a cluster. Snowflake is even faster than that since its storage is more optimized.
And Athena is an old fork of Presto, provided as a service; modern Presto e.g. Starburst is much faster than Athena, and cheaper if you use it much as Athena has usage fees not hosting fees.
> You cannot browse the data structure without an active cluster
That's one of the reason's I'm interested in delta-rs [1], which has delta lake bindings for Python. Would love to read a delta lake table into a native python object without the need for spark.
I see it as why the article supports Databricks as an RDBMS; it offers something others do not.
You can't currently* do the same extensive UDFs in Snowflake or BQ and, sometimes, they are important.
But with SnowPark coming, hopefully you won't have to make such a large sacrifice to SQL users' experience for it.
* Currently you can do JavaScript UDFs and external functions in Snowflake, and BigQuery ML is worth mentioning here too. Those cover some, but not all, of what you might use a Spark UDF for in SQL.
They have thought about how they can improve the DS experience. Inconsistent storage? DeltaLake. Slow Spark queries? Databricks Delta. Model management? MLFlow (I haven't adopted this, but can't pin down why -- on face value it seems great). Development environment? Databricks Connect. Cluster management? Core.
But the same is not true for SQL analysts. Today's offering does not empathise with them. I'm unsure integrating Redash is a genuine reply to their needs.
The upside here is that (1) Databricks (or at least, Databricks' marketing) appears to be prioritising this need, and (2) A lot of people are betting a lot money that they can do this well.
>Model management? MLFlow (I haven't adopted this, but can't pin down why -- on face value it seems great).
Probably because you want your code to be about the problem you're trying to solve, not about tracking experiments. Similar to Anti-lock braking system or Electronic stability control systems in a car: you want them to be "on" by default, not to activate them every five minutes while driving.
I would guess high cost, market saturation, and not enough research and development. It seemed like an ok product when we evaluated them, but the cost was prohibitive. After the hortonworks and cloudera merger, the future of the hadoop/spark open source ecosystem seemed extremely dull and we came really close to using either databricks or one of their competitors. We found that spark on kubernetes appears to have a bright future. So, we didn't have to double our spark cluster costs by going to databricks.
I have seen several deployments of Databricks (including first hand experience) and... most use cases could be better served by Postgres (or Redshift, Athena, Snowflake for larger scale). It has honestly been such an overkill for so many workloads, it was quite astonishing. I've seen people move from Excel to Spark... to handle the same volume of data. That's obviously not Databricks' fault, but their PR is pretty much "please do all your data work in our product, it's well suited for it".
Yes, it's very good if you don't like setting up clusters (few do/can) and the UI is rather useful for getting up and running (not so much for writing code though). But you need to really understand the platform before adopting it. Please, please don't just adopt it because it's popular.
These are great innovations but can we please take a moment to realize 99% of companies are still stuck with a blend of excels, access and sql servers? Why is adoption of this new tech so poor? Maybe it has something to do with the amount of confusion all the sales pitches about data lakes, big data, ai and company are generating
Because it's unnecessary to those 99% of companies. If company data fits in an Excel spreadsheet, Access database, or a single MySQL/Postgres instance, then introducing all this new tech with all of the associated costs and little return gain is a net loss.
Some of this new tech is simpler to run than MySQL/Postgres like S3/Spectrum. Excel is ok for last mile type of analytics but it doesn't work if you are looking for serious insights. Excel also generates a significant amount of unnecessary manual work and data quality issues. It's not so black and white, what I am saying is that the analytics tech industry direction doesn't seem to address the real issues.
Sure. It's possible to create something that is both good for small data and has a better feature set than Excel/Access/etc., but then you do not have the other main asset of those databases: popularity. It's relatively very easy to find people skilled in these technologies, whereas new technologies need to come either with advocates capable of teaching others on their own or programmers already skilled with them. Otherwise, teaching people new technical skills is costly.
Because most data isn't "big" data. If it fits in Excel on a laptop, why bother to roll out a distributed data lake system like Spark with all the associated ops work.
... and if it doesn't fit on Excel on a laptop, the natural path is just some database running on a desktop, then a server, then a beefier server. It's actually amazing how far just something like postgres will scale for your data needs.
The ones I have been talking to stick to their Excel because their little part of the puzzle can solved by Excel. These technologies usually demand data to be collected, but the amount of incentive to do so is low in any classic balkanized F500 organization.
1. It is not necessary. For smaller datasets excel or sql servers are sufficient.
2. The small dataset has grown over the years and a other solution would be beneficial.
But people like the things they are familiar with.
They do not want to have the additional work, of learning new tools. Who ever wants to change the existing structures, will usually face high resistance.
In the second case, companies often accept the issues of the old solution and they tell themself it is the right way.
I actually think the effect of all of this hype will move people past the Excel/Access era. There are simple analyses every company could do with R or scikit-learn that would save or make them money, and they just don't know how. Someone with AI expertise is over-qualified to do this, but they are at least qualified.
Because 99% of new tech is bullshit. It changed too much, no one can keep up. These old stacks are still around because they're proven to get the job done.
data lake, delta lake, lake house. snowflake, snowpark. cloud, data warehouse. I just want to take a second to thank these companies for trying to turn my profession (data science + analytics) into the living hell of your every day enterprise and tech culture cluster-fuck (pun not intended).
I'm still coming to terms with the fact that there's actually a technology called Kafka...
/sorry, I'm just particularly bitter after having to do some databricks training yesterday... it was basically 80% trying to rote the sales literature about why they're so great/enterprisy and parroting company and platform specific jargon. I don't like seeing my profession turn into an obsession with tech and platforms when 99.9% of companies and people can't reason properly or operate their current tools/resources efficiently. obviously just my own opinion.
I read a wonderful comment in another thread the other day. It was about the obsession of devs to collect a whole range of technologies on their resume. It went something like: "When I hire a carpenter I won't hire him for what he has in his toolbox. I want to know what he can do with only a hammer and chisel (= Linux machine). The rest he can learn."
I will try to look up the link and obviously 'he' can be 'she' as well.
Databricks is not an RDBMS...it's a Unified Analytics Platform (their words i know), but that really describes it well. It's been really liberating to having easy access to versioned data at your fingertips when iterating on models...and also being able to track models and share work. It does have some warts but it's much better than any other single platform I've used for data science/ML
Hmm, I agree there's a market here, but I don't know why I wouldn't just use Snowflake or Bigquery if what I really wanted was a big-data RDBMS.
Everyone I know who uses Databricks (and they all like it) use it as hosted Spark with S3 integrations, or write... directly to Snowflake. I'm a little skeptical they're going to get traction as a true data lake model
Snowflake and Bigquery will bite you in the ass later on. You can do 80% of the things you will need - which is great for some newbie stuff or for sales presentations. Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.
You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.
Databricks/Spark are just good platforms to help you do something with structured data in your lake. With the recent additions to its execution engine and Delta (strange naming tbh) it will be pretty much the same as Snowflake for you.
BigQuery/Snowflake can process Parquet and multiple other formats in Object Storage. You can use them more "freely" if you keep your raw data in open formats.
You need something more complicated than what can be done using BigQuery/Snowflake (that remaining 20%, though I would say 10%)? Export the dataset to CSV/Parquet/Avro/ORC/whatever and process it with anything, including Dataproc/HDInsight/EMR or even Databricks. That's actually a common pattern.
While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.
Can't do a simple
select * from s3://file.parquet
which you can do in Spark. Having to load it into the data warehouse means that you duplicate your data two times and it is stupidly annoying.
Many times the data doesn't even resemble anything tabular before I structure it in python scripts. Why would I load it inside a warehouse only to then pull it down to do some python processing and loading it back. Which makes the data travel from DWH storage to a data lake and then to my compute cluster and then the same cumbersome roadtrip back. Pretty wasteful. Spark at least allows me to schedule a python function across a cluster while copying only from lake to my compute node and back.
Data warehouses like BQ and Snowflake are great for data scientists after a bunch of engineers slice and dice raw data into clean tables. For anyone working with not yet structured data, data lake wins hands down.
Not quite -- Snowflake allows you to query data directly from S3 without having to ingest it. You can either query the S3 file directly [1], or define an external table on top of it [2]. If the data is JSON, Avro, XML, or some other "semi-structured" format, then Snowflake's "variant" data type makes this type of data very easy to work with. For raw text or log lines, you can still load to Snowflake and process using regex or UDFs. For unstructured or "complex" data like images and PDFs, then you may need to reach for a different compute environment.
Yeah I thought Snowflake external tables would do this but it is not the case.
External table in Snowflake only allows you to ingest data from s3 to their storage (which is also s3 behind the scenes).
Perhaps something has changed since the last time I tried, but when I tried my conclusion was that "external tables" in Snowflake are not what you think they are.
Also I have not seen examples of "select * from s3://file.json" in the links you provided.
> Can't do a simple select * from s3://file.parquet
That's not really true; Snowflake can query directly from S3 just fine, even from other clouds. You just need to set up the credentials (or supply them in the query, but that's not usually a good idea).
> While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.
This is not true. Snowflake allows you to create an external stage (pointer to s3) and then query any prefix you want as long as you provide the correct file format arguments or types.
select * from @somestage/someprefix file_format(....)
edit: hadn't refreshed the page and i can see others have already responded to this point.
In BQ you can query data as an external table with Hive Partioning. No need for duplication and extremely useful for consuming "Landing Zones" in a Datalake.
Pretty similar as defining a Hive Table and then using any other engine to process it.
PS: BigQuery Omni (now beta) will support object storage solutions from other cloud providers.
>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.
Snowflake has spark connector too. So I don't know what the difference would be writing a spark job against deltalake vs snowflake.
> 80% of the things you will need - which is great for some newbie stuff or for sales presentations.
This is obviously wrong.
> You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.
We store unstructured data in snowflake. I don't understand why you need a datalake on top of it.
EXACTLY. You absolutely can store unstructured and semi structured data in Snowflake. I find it baffling and at this point a bit irritating that there is this community of people insisting that is not allowed for...some unspecified reason.
Why would I store eg. bunch of html files as string columns in Snowflake, only to download them down, process them in python and load back into some other string table.
Because it actually costs the same, and if you process them in Snowflake using SQL or UDFs, you will get your results in seconds and you won't have to manage any of the underlying infrastructure.
>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.
Snowflake supports enough SQL constructs to allow for very complicated queries. If that doesn't suit your needs then there's stored procedures and custom javascript UDFs you can write. That covers probably 99+% of the use cases at most companies and usually the rest can be done somewhere else on pre-aggregated data.
1. Why tf would javascript be picked as the UDF language when Python dominates the data world.
2. People usually load complex Python libraries for data processing. I wonder if Snowflake UDF would support that or just allow you to use standard library.
1. why not? it's not like you're getting external libraries either way.
2. if you want complex libraries, you can use external functions (aws lambdas, etc) in Snowflake.
It's a lot simpler to use a single system as both your data lake, and your data warehouse. As Databricks gets better and better at the core data warehouse features, it becomes feasible to use it for both. Meanwhile, Snowflake and BQ are coming from the other direction, implementing data lake features. AWS strategy seems to be, just make it easier to have 2 systems and move data back and forth.
The reason snowflake has such a high market cap is because it’s customers aren’t paying much now, but they’ll be paying and paying monthly forever. It’s lock-in on a massive scale.
Delta lake is something you can run on data you feel you still have some semblance of control over.
I work for Snowflake and I'm always curious about this way of thinking. How is putting the data in Snowflake any different to any other RDBMS, or even S3, in this regard?
You don't own the underlying storage, true, but there's a defined method of getting data in and a defined method of getting that data out. The API is different to S3, sure, but it's an API all the same.
FWIW the only organisation I know that's ever truly locked in data was SAP, through their byzantine licensing and horrific API. Snowflake is no more locked in than anything else with a standard SQL interface.
Their Delta also has SSD caching, which turns out to be logic that stores a local copy of the file you queried for faster re-query. Going to call my lru cache function like that as well...
My company loves them, I think they only do a few things good of which marketing the best, and are not worth the money for data science teams with devops skills. Happy to hear from others if I am wrong.
Depends. If you only have a couple workspaces, then no. It's worth the money for a company with lots of data science teams, and one team who is janitoring all the workspace infra. Our company has 20 workspaces for 10 teams already and it would be a nightmare if we expected everyone to fix and manage their own AWS stuff.
I’m one of the apparently odd few who doesn’t like it. It’s slow, overkill for many tasks, and a Swiss Army knife that’s ok at everything but not very good at anything. Debugging it can be a very frustrating experience. They’re trying to turn it into an everything platform so it checks every box in the procurement process and don’t care about actually making it good.
I have a feeling legacy databricks installations will be much derided in 5-10 years time
Were there some serious mental gymnastics coming from Fivetran yesterday, or am I missing something? The CEO had a data lake hit piece in Forbes ('The Business Case For Ditching Your Data Lake'[1]) and almost simultaneously authored this 'Databricks' Delta Lake is not a data lake' blog.
Felt like the CEO wanted to make a strong point, but has a relationship with Databricks that he didn't want to undermine, so did some reclassification gymnastics to make it all fit. But I am probably too cynical.
Finally, I beg for mercy from the 'lakehouse'. It is a bridge too far ;-)
The key gap as I see it is that databricks doesn’t support multi-table transactions. And that is why you can’t treat it as though it as a drop in functional replacement for an rdbms.
(Why no vector clocks in the manifest files or something?)
No, Databricks is a distribution of Apache Spark with some value-added features such as the Delta Lake data format and a clean UI for hosting notebooks, doing cluster admin, etc
As a DS/DE, there's a lot to love (not all, but a lot). The easy provision of Spark clusters. The jobs API. DeltaLake (mostly). Easy notebooks (please don't create a prod system from these..). And Spark itself continues to improve, albeit in an increasingly crowded field.
But I've worked closely with BigCo SQL analysts on Azure Databricks, and their experience was terrible. For example:
If Databricks-as-an-RDBMS were competing against Teradata, sure, let's have a chat.But we're in 2021, and there's just no comparing the experience of the SQL analyst on Databricks-as-an-RDBMS vs. Snowflake/BigQuery.
I'm excited for the potential of Snowflake's SnowPark (though know little about it). Calling UDFs from SQL means you can create great features for SQL analysts, provided that they can build the momentum to need it.