Databricks is an RDBMS

jpau · on Feb 1, 2021

I'm deeply disappointed in Databricks as an RDBMS.

As a DS/DE, there's a lot to love (not all, but a lot). The easy provision of Spark clusters. The jobs API. DeltaLake (mostly). Easy notebooks (please don't create a prod system from these..). And Spark itself continues to improve, albeit in an increasingly crowded field.

But I've worked closely with BigCo SQL analysts on Azure Databricks, and their experience was terrible. For example:

  - You cannot browse the data structure without an active cluster
  
  - Starting a cluster can take ~5 minutes and, since you missed that moment, you may not submit your first query until 10-15 minutes.
  
  - The SQL error messages are often (perhaps usually?) nonsense, so you have to operate without them.
  
  - An unfortunate amount of downtime, followed by bizarre excuses.
  
  - It's so darn slow, relative to equivalent queries on BigQuery or Snowflake.
  
  - Even submitting a query can take a weird amount of time.

If Databricks-as-an-RDBMS were competing against Teradata, sure, let's have a chat.

But we're in 2021, and there's just no comparing the experience of the SQL analyst on Databricks-as-an-RDBMS vs. Snowflake/BigQuery.

I'm excited for the potential of Snowflake's SnowPark (though know little about it). Calling UDFs from SQL means you can create great features for SQL analysts, provided that they can build the momentum to need it.

mrbungie · on Feb 1, 2021

I've seen and have compared Databricks clusters to a 10-15yo Teradata cluster and no way in hell I would use Databricks.

Teradata is a lot faster for interactive workloads than Databricks.

PS: I agree there's no comparing on Databricks vs Snowflake/BigQuery.

snidane · on Feb 1, 2021

Teradata is shared-nothing architecture. Of course it will outperform Databricks and Snowflake shared-disk model as the data is colocated on the compute nodes so it doesn't have to travel anywhere.

Good luck on your budget trying to scale up shared-nothing database and making it scale up and down based on workload without downtime.

You can achieve significant speedup resembling shared-nothing databases by pushing the data close to the query using caching. Snowflake does it out of the box as it maintains table metadata. Databricks can do it too, but you have to be careful and it sucks.

anonymousDan · on Feb 2, 2021

Wait what? I thought spark is designed to work on top of a shared nothing cluster? Have you any resources describing the architecture of databricks specifically?

snidane · on Feb 2, 2021

In Databricks you only get cloud storage (s3) as a persistent storage for data. So the data has to move from s3 to compute nodes.

Self hosted HDFS will work more like shared nothing.

anonymousDan · on Feb 2, 2021

Ah,thanks, makes sense.

georgewfraser · on Feb 2, 2021

If Teradata is faster for workload X or Y, it isn’t because of a shared nothing architecture. Databricks and Snowflake both cache the working set on local disks of the worker nodes, so in practice they are shared nothing systems from a performance perspective.

snidane · on Feb 2, 2021

Only after using the data beforehand it gets cached or by actively pushing it into cache. Otherwise everything sits in cold storage far away.

And having control over the cache sucks as hell. You can't pin down a table to reside in compute node disk cache when you know it will be used often.

mrbungie · on Feb 1, 2021

Fair enough, I'm just talking from the perspective of a SQL analyst who could use both Teradata and Databricks at work.

Teradata was faster with no user-facing tuning vs tuned Databricks. And if you can pay for Teradata you may as well use it.

ConnorLeet · on Feb 2, 2021

Do you have any resources or examples of challenges with scaling up a shared-nothing architecture? At the crossroads between a shared-nothing and shared disk decision at the moment.

snidane · on Feb 2, 2021

With shared-disk the data sits in cold storage (eg. s3) for which you pay cheaply. Only when you have to do some big computation you spin up a temporary compute cluster, do the computation, store resukt back in s3 and kill the cluster. You can also have long lived compute clusters and shut them or scale them down at night when nobody uses them. Since the data doesn't "live" in the compute nodes, you can add or remove as many compute nodes as you wish.

In shared-nothing the data "lives" on the compute nodes, so you can't willy nilly add or remove nodes. The data would either get lost if you removed more nodes than what is necessary for triple replicated redundancy. Or if you added nodes you'd have to wait before the data gets rebalanced to those new nodes, resulting in massive reshuffling of everything.

Keep in mind that shared-nothing clusters are most likely long lived and have multiple large datasets sitting on them, so by adding nodes you will start to shuffle everything around.

With shared-disk your compute cluster is only a single use for one dataset and you don't lose data as you scale cluster up and down.

There is a hybrid architecture which will give you both, but doesn't exist out there: Shared-disk with active caching. Ie. giving you control over which tables will be pinned down on your temporary compute-cluster. That will give you performance of shared-nothing but with convenience of shared-disk temporary compute clusters.

nattaylor · on Feb 1, 2021

I got excellent performance in Databricks with well partitioned Parquet and Spark 2.4. What is making the queries slow? Data scanning?

jpau · on Feb 1, 2021

They use DeltaLake + Spark 3.0, and are mostly careful to partition well.

Their datasets are small. Most tables are ~50GB, the odd table up to ~2TB. The clusters typically are nothing shabby for this size, defaults to ~[4-12]x32GB.

The queries that I have seen are typically not written well. Think view-on-view-on-view (there's a BigCo policy against them materialising data..), and where the filter is applied in the last step. The stuff of horrors, but something I've seen in more-than-one-BigCo.

But we have compared some of those same queries on BigQuery vs. Databricks, and, I don't know if BigQuery's execution optimiser is better? Or if the BigQuery storage is better organising the data? Or if BigQuery is simply throwing more resource their way?

wavesquid · on Feb 2, 2021

> Think view-on-view-on-view (there's a BigCo policy against them materialising data..), and where the filter is applied in the last step. The stuff of horrors, but something I've seen in more-than-one-BigCo.

That sounds wonderful (really). I was contracting for a BigCo where they materialised things all the time, and they would regularly end up running queries over multiple materialisations from different points in time, which invariably means that you always get wrong answers. I very much wished to put a stop to use of any materilised views, but didn't have the buy-in to make the policy.

simo7 · on Feb 2, 2021

> That sounds wonderful (really)

Was going to say. Most of the times all it takes is to have a proper data model.

For analytics I favour de-nomarlized schemas and, if necessary, nested fields. Queries are much easier to write (fewer joins), much faster, no need to incrementally materialize (sigh), fewer backfills and no messy field definitions.

What you often see instead is highly-normalized data models with an un-trackable amount of materialized views (usually on top each others) and some complicated tools/solutions to try to deal with all that mess. The cost of a bad design.

marcinzm · on Feb 1, 2021

In my experience Athena on AWS beats Spark by an order of magnitude in terms of performance and price. Without the annoyance of having to start/stop/run a cluster. Snowflake is even faster than that since its storage is more optimized.

willvarfar · on Feb 2, 2021

And Athena is an old fork of Presto, provided as a service; modern Presto e.g. Starburst is much faster than Athena, and cheaper if you use it much as Athena has usage fees not hosting fees.

Braxton_Hicks · on Feb 1, 2021

> You cannot browse the data structure without an active cluster

That's one of the reason's I'm interested in delta-rs [1], which has delta lake bindings for Python. Would love to read a delta lake table into a native python object without the need for spark.

[1] https://github.com/delta-io/delta-rs

bkandel · on Feb 1, 2021

Yes, and I would add to this the (nearly) complete lack of IDE support makes working with Spark SQL quite painful.

StreamBright · on Feb 2, 2021

And on the top of that if i need transactions and SQL Data Lake wouldn’t be my first choice. There are other options.

victor106 · on Feb 2, 2021

> Calling UDFs from SQL means you can create great features for SQL analysts

You can do that in Spark, no?

jpau · on Feb 2, 2021

You sure can :)

I see it as why the article supports Databricks as an RDBMS; it offers something others do not.

You can't currently* do the same extensive UDFs in Snowflake or BQ and, sometimes, they are important. But with SnowPark coming, hopefully you won't have to make such a large sacrifice to SQL users' experience for it.

* Currently you can do JavaScript UDFs and external functions in Snowflake, and BigQuery ML is worth mentioning here too. Those cover some, but not all, of what you might use a Spark UDF for in SQL.

jgalt212 · on Feb 2, 2021

but they have $400MM of 2020 revenue. How can they do that with such a disappointing product?

jpau · on Feb 2, 2021

Databricks is good as a managed Spark platform.

They have thought about how they can improve the DS experience. Inconsistent storage? DeltaLake. Slow Spark queries? Databricks Delta. Model management? MLFlow (I haven't adopted this, but can't pin down why -- on face value it seems great). Development environment? Databricks Connect. Cluster management? Core.

But the same is not true for SQL analysts. Today's offering does not empathise with them. I'm unsure integrating Redash is a genuine reply to their needs.

The upside here is that (1) Databricks (or at least, Databricks' marketing) appears to be prioritising this need, and (2) A lot of people are betting a lot money that they can do this well.

Tomorrow looks sunny.

Jugurtha · on Feb 2, 2021

>Model management? MLFlow (I haven't adopted this, but can't pin down why -- on face value it seems great).

Probably because you want your code to be about the problem you're trying to solve, not about tracking experiments. Similar to Anti-lock braking system or Electronic stability control systems in a car: you want them to be "on" by default, not to activate them every five minutes while driving.

bostonsre · on Feb 2, 2021

I would guess high cost, market saturation, and not enough research and development. It seemed like an ok product when we evaluated them, but the cost was prohibitive. After the hortonworks and cloudera merger, the future of the hadoop/spark open source ecosystem seemed extremely dull and we came really close to using either databricks or one of their competitors. We found that spark on kubernetes appears to have a bright future. So, we didn't have to double our spark cluster costs by going to databricks.

agambrahma · on Feb 1, 2021

Sigma Computing (https://www.sigmacomputing.com) might be a good fit here too.

(disclaimer: plug)

drej · on Feb 1, 2021

I have seen several deployments of Databricks (including first hand experience) and... most use cases could be better served by Postgres (or Redshift, Athena, Snowflake for larger scale). It has honestly been such an overkill for so many workloads, it was quite astonishing. I've seen people move from Excel to Spark... to handle the same volume of data. That's obviously not Databricks' fault, but their PR is pretty much "please do all your data work in our product, it's well suited for it".

Yes, it's very good if you don't like setting up clusters (few do/can) and the UI is rather useful for getting up and running (not so much for writing code though). But you need to really understand the platform before adopting it. Please, please don't just adopt it because it's popular.

kfk · on Feb 1, 2021

These are great innovations but can we please take a moment to realize 99% of companies are still stuck with a blend of excels, access and sql servers? Why is adoption of this new tech so poor? Maybe it has something to do with the amount of confusion all the sales pitches about data lakes, big data, ai and company are generating

phoe-krk · on Feb 1, 2021

> Why is adoption of this new tech so poor?

Because it's unnecessary to those 99% of companies. If company data fits in an Excel spreadsheet, Access database, or a single MySQL/Postgres instance, then introducing all this new tech with all of the associated costs and little return gain is a net loss.

kfk · on Feb 2, 2021

Some of this new tech is simpler to run than MySQL/Postgres like S3/Spectrum. Excel is ok for last mile type of analytics but it doesn't work if you are looking for serious insights. Excel also generates a significant amount of unnecessary manual work and data quality issues. It's not so black and white, what I am saying is that the analytics tech industry direction doesn't seem to address the real issues.

phoe-krk · on Feb 2, 2021

Sure. It's possible to create something that is both good for small data and has a better feature set than Excel/Access/etc., but then you do not have the other main asset of those databases: popularity. It's relatively very easy to find people skilled in these technologies, whereas new technologies need to come either with advocates capable of teaching others on their own or programmers already skilled with them. Otherwise, teaching people new technical skills is costly.

baskire · on Feb 2, 2021

Might even see negative gain.

Often time the query perf per machine goes down as you move from a single node, indexed, postgresql To a multi node unstructured data.

If you don’t have so much data or unstructured data. Why use the more complex and generally slower per FLOP solution.

t0mas88 · on Feb 1, 2021

Because most data isn't "big" data. If it fits in Excel on a laptop, why bother to roll out a distributed data lake system like Spark with all the associated ops work.

noxvilleza · on Feb 1, 2021

... and if it doesn't fit on Excel on a laptop, the natural path is just some database running on a desktop, then a server, then a beefier server. It's actually amazing how far just something like postgres will scale for your data needs.

hztar · on Feb 1, 2021

The ones I have been talking to stick to their Excel because their little part of the puzzle can solved by Excel. These technologies usually demand data to be collected, but the amount of incentive to do so is low in any classic balkanized F500 organization.

magicalhippo · on Feb 2, 2021

Most of our customers have SQL databases in the 5-20GB range. The largest one is around 100GB. Most of them self-host.

What would "this new tech" bring us and our customers, and would it be enough to justify the cost of porting?

Draken93 · on Feb 2, 2021

There are two scenarios

1. It is not necessary. For smaller datasets excel or sql servers are sufficient.

2. The small dataset has grown over the years and a other solution would be beneficial. But people like the things they are familiar with. They do not want to have the additional work, of learning new tools. Who ever wants to change the existing structures, will usually face high resistance.

In the second case, companies often accept the issues of the old solution and they tell themself it is the right way.

QuesnayJr · on Feb 1, 2021

I actually think the effect of all of this hype will move people past the Excel/Access era. There are simple analyses every company could do with R or scikit-learn that would save or make them money, and they just don't know how. Someone with AI expertise is over-qualified to do this, but they are at least qualified.

vmsp · on Feb 1, 2021

I know Excel runs the world but are there really that many people using Access?

Tarq0n · on Feb 2, 2021

In very technology deficient enterprise organizations it's sometimes the only query tool people are allowed access to.

phoe-krk · on Feb 2, 2021

> are allowed access to.

I enjoyed the pun.

fukmbas · on Feb 2, 2021

Because 99% of new tech is bullshit. It changed too much, no one can keep up. These old stacks are still around because they're proven to get the job done.

ACow_Adonis · on Feb 1, 2021

data lake, delta lake, lake house. snowflake, snowpark. cloud, data warehouse. I just want to take a second to thank these companies for trying to turn my profession (data science + analytics) into the living hell of your every day enterprise and tech culture cluster-fuck (pun not intended).

I'm still coming to terms with the fact that there's actually a technology called Kafka...

/sorry, I'm just particularly bitter after having to do some databricks training yesterday... it was basically 80% trying to rote the sales literature about why they're so great/enterprisy and parroting company and platform specific jargon. I don't like seeing my profession turn into an obsession with tech and platforms when 99.9% of companies and people can't reason properly or operate their current tools/resources efficiently. obviously just my own opinion.

MikeDelta · on Feb 1, 2021

I read a wonderful comment in another thread the other day. It was about the obsession of devs to collect a whole range of technologies on their resume. It went something like: "When I hire a carpenter I won't hire him for what he has in his toolbox. I want to know what he can do with only a hammer and chisel (= Linux machine). The rest he can learn."

I will try to look up the link and obviously 'he' can be 'she' as well.

dominotw · on Feb 2, 2021

> I want to know what he can do with only a hammer and chisel (= Linux machine). The rest he can learn.

I don't know about this, I would be a bit hesitant if the carpenter to build my kitchen cabinets only knows how to use hammer and a chisel.

arafa · on Feb 1, 2021

The training is quite heavy on the sales pitch, I agree (and expensive). There were some useful bits if you dig around though.

kthejoker2 · on Feb 1, 2021

I reformatted this comment as found poetry on LinkedIn https://www.linkedin.com/pulse/overheard-hacker-news-kafkaes...

I hope you don't mind, it was, to my eye and ear, an instant classic.

bertomart · on Feb 2, 2021

Databricks is not an RDBMS...it's a Unified Analytics Platform (their words i know), but that really describes it well. It's been really liberating to having easy access to versioned data at your fingertips when iterating on models...and also being able to track models and share work. It does have some warts but it's much better than any other single platform I've used for data science/ML

wokwokwok · on Feb 1, 2021

This 100 word article says everything about databricks and the whole ecosystem of content free data engineering blog posts.

It’s all buzz words, no content and please signup now.

How so this link so we’ll upvoted?

Did people just see “databricks” and think ooo... I have an opinion on that?

Read the article before you upvote folks; this one doesn’t deserve your time.

bpodgursky · on Feb 1, 2021

Hmm, I agree there's a market here, but I don't know why I wouldn't just use Snowflake or Bigquery if what I really wanted was a big-data RDBMS.

Everyone I know who uses Databricks (and they all like it) use it as hosted Spark with S3 integrations, or write... directly to Snowflake. I'm a little skeptical they're going to get traction as a true data lake model

snidane · on Feb 1, 2021

Snowflake and Bigquery will bite you in the ass later on. You can do 80% of the things you will need - which is great for some newbie stuff or for sales presentations. Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.

You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.

Databricks/Spark are just good platforms to help you do something with structured data in your lake. With the recent additions to its execution engine and Delta (strange naming tbh) it will be pretty much the same as Snowflake for you.

mrbungie · on Feb 1, 2021

BigQuery/Snowflake can process Parquet and multiple other formats in Object Storage. You can use them more "freely" if you keep your raw data in open formats.

You need something more complicated than what can be done using BigQuery/Snowflake (that remaining 20%, though I would say 10%)? Export the dataset to CSV/Parquet/Avro/ORC/whatever and process it with anything, including Dataproc/HDInsight/EMR or even Databricks. That's actually a common pattern.

snidane · on Feb 1, 2021

While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.

Can't do a simple

    select * from s3://file.parquet

which you can do in Spark. Having to load it into the data warehouse means that you duplicate your data two times and it is stupidly annoying.

Many times the data doesn't even resemble anything tabular before I structure it in python scripts. Why would I load it inside a warehouse only to then pull it down to do some python processing and loading it back. Which makes the data travel from DWH storage to a data lake and then to my compute cluster and then the same cumbersome roadtrip back. Pretty wasteful. Spark at least allows me to schedule a python function across a cluster while copying only from lake to my compute node and back.

Data warehouses like BQ and Snowflake are great for data scientists after a bunch of engineers slice and dice raw data into clean tables. For anyone working with not yet structured data, data lake wins hands down.

carlineng · on Feb 1, 2021

Not quite -- Snowflake allows you to query data directly from S3 without having to ingest it. You can either query the S3 file directly [1], or define an external table on top of it [2]. If the data is JSON, Avro, XML, or some other "semi-structured" format, then Snowflake's "variant" data type makes this type of data very easy to work with. For raw text or log lines, you can still load to Snowflake and process using regex or UDFs. For unstructured or "complex" data like images and PDFs, then you may need to reach for a different compute environment.

[1] https://docs.snowflake.com/en/user-guide/querying-stage.html [2] https://docs.snowflake.com/en/user-guide/tables-external-int...

snidane · on Feb 2, 2021

Yeah I thought Snowflake external tables would do this but it is not the case.

External table in Snowflake only allows you to ingest data from s3 to their storage (which is also s3 behind the scenes).

Perhaps something has changed since the last time I tried, but when I tried my conclusion was that "external tables" in Snowflake are not what you think they are.

Also I have not seen examples of "select * from s3://file.json" in the links you provided.

I would like to be corrected if I'm mistaken.

dominotw · on Feb 2, 2021

> External table in Snowflake only allows you to ingest data from s3 to their storage

No. External table doesn't do any "ingesting" whatsoever. You can if you want to but you thats not the primary usecase.

https://docs.snowflake.com/en/user-guide/tables-external-int...

" enables querying data stored in files in an external stage as if it were inside a database "

> I have not seen examples of "select * from s3://file.json" in the links you provided.

you cannot directly query a random file off of s3( not sure why someone would want to do that) . Snowflake is a database , not a python script.

You have to build a an external table

" create external table abc location=s3://dir-to-file "

then

" select * from abc "

> "external tables" in Snowflake are not what you think they are.

They are exactly what you think they are. Tables stored externally to snowflake.

snidane · on Feb 2, 2021

Thanks for clarification. I either wasn't able to figure this out back then or things have changes since then.

fphhotchips · on Feb 1, 2021

> Can't do a simple select * from s3://file.parquet

That's not really true; Snowflake can query directly from S3 just fine, even from other clouds. You just need to set up the credentials (or supply them in the query, but that's not usually a good idea).

Disclaimer: I work for Snowflake

chrisjc · on Feb 2, 2021

> While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.

This is not true. Snowflake allows you to create an external stage (pointer to s3) and then query any prefix you want as long as you provide the correct file format arguments or types.

select * from @somestage/someprefix file_format(....)

edit: hadn't refreshed the page and i can see others have already responded to this point.

mrbungie · on Feb 1, 2021

In BQ you can query data as an external table with Hive Partioning. No need for duplication and extremely useful for consuming "Landing Zones" in a Datalake.

Pretty similar as defining a Hive Table and then using any other engine to process it.

PS: BigQuery Omni (now beta) will support object storage solutions from other cloud providers.

dominotw · on Feb 2, 2021

>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.

Snowflake has spark connector too. So I don't know what the difference would be writing a spark job against deltalake vs snowflake.

> 80% of the things you will need - which is great for some newbie stuff or for sales presentations.

This is obviously wrong.

> You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.

We store unstructured data in snowflake. I don't understand why you need a datalake on top of it.

georgewfraser · on Feb 2, 2021

EXACTLY. You absolutely can store unstructured and semi structured data in Snowflake. I find it baffling and at this point a bit irritating that there is this community of people insisting that is not allowed for...some unspecified reason.

snidane · on Feb 2, 2021

Why would I store eg. bunch of html files as string columns in Snowflake, only to download them down, process them in python and load back into some other string table.

I could do the same in s3 for much cheaper.

georgewfraser · on Feb 2, 2021

Because it actually costs the same, and if you process them in Snowflake using SQL or UDFs, you will get your results in seconds and you won't have to manage any of the underlying infrastructure.

snidane · on Feb 2, 2021

Can I do some advanced xml stuff say using lxml with beautifulsoup using UDFs?

marcinzm · on Feb 1, 2021

>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.

Snowflake supports enough SQL constructs to allow for very complicated queries. If that doesn't suit your needs then there's stored procedures and custom javascript UDFs you can write. That covers probably 99+% of the use cases at most companies and usually the rest can be done somewhere else on pre-aggregated data.

snidane · on Feb 2, 2021

1. Why tf would javascript be picked as the UDF language when Python dominates the data world.

2. People usually load complex Python libraries for data processing. I wonder if Snowflake UDF would support that or just allow you to use standard library.

chrisjc · on Feb 2, 2021

1. why not? it's not like you're getting external libraries either way. 2. if you want complex libraries, you can use external functions (aws lambdas, etc) in Snowflake.

georgewfraser · on Feb 1, 2021

It's a lot simpler to use a single system as both your data lake, and your data warehouse. As Databricks gets better and better at the core data warehouse features, it becomes feasible to use it for both. Meanwhile, Snowflake and BQ are coming from the other direction, implementing data lake features. AWS strategy seems to be, just make it easier to have 2 systems and move data back and forth.

willvarfar · on Feb 1, 2021

The reason snowflake has such a high market cap is because it’s customers aren’t paying much now, but they’ll be paying and paying monthly forever. It’s lock-in on a massive scale.

Delta lake is something you can run on data you feel you still have some semblance of control over.

fphhotchips · on Feb 2, 2021

I work for Snowflake and I'm always curious about this way of thinking. How is putting the data in Snowflake any different to any other RDBMS, or even S3, in this regard?

You don't own the underlying storage, true, but there's a defined method of getting data in and a defined method of getting that data out. The API is different to S3, sure, but it's an API all the same.

snidane · on Feb 2, 2021

When I want to process something in Snowflake I would have to copy the data

    1. from snowflake storage to s3
    2. from s3 to my compute node
    3. do the processing
    4. store result in s3
    5. copy result from s3 to snowflake storage

If I only use s3, my data gies like this

    1. s3 to compute node
    2. run my script
    3. result to s3

Perhaps I could use bunch if insert statement and load data into Snowflake through sql straight feom compute node, but then at a great cost.

I don't really see much benefit of using Snowflake for not-yet structured data. For structured data it works wonders. Only downsides there:

    1. Expensive as f when you scale up
    2. can't control cost in pay-per-query model.

dominotw · on Feb 2, 2021

> but they’ll be paying and paying monthly forever. It’s lock-in on a massive scale.

Is making money by tricking customers into product lock in really a good business strategy. Wouldn't the word get out sooner or later?

dont you have to pay aws for s3,emr /data engineers monthly and forever too.

fphhotchips · on Feb 2, 2021

FWIW the only organisation I know that's ever truly locked in data was SAP, through their byzantine licensing and horrific API. Snowflake is no more locked in than anything else with a standard SQL interface.

MikeDelta · on Feb 1, 2021

Their Delta also has SSD caching, which turns out to be logic that stores a local copy of the file you queried for faster re-query. Going to call my lru cache function like that as well...

My company loves them, I think they only do a few things good of which marketing the best, and are not worth the money for data science teams with devops skills. Happy to hear from others if I am wrong.

pram · on Feb 1, 2021

Depends. If you only have a couple workspaces, then no. It's worth the money for a company with lots of data science teams, and one team who is janitoring all the workspace infra. Our company has 20 workspaces for 10 teams already and it would be a nightmare if we expected everyone to fix and manage their own AWS stuff.

opportune · on Feb 2, 2021

I’m one of the apparently odd few who doesn’t like it. It’s slow, overkill for many tasks, and a Swiss Army knife that’s ok at everything but not very good at anything. Debugging it can be a very frustrating experience. They’re trying to turn it into an everything platform so it checks every box in the procurement process and don’t care about actually making it good.

I have a feeling legacy databricks installations will be much derided in 5-10 years time

LukeEF · on Feb 2, 2021

Were there some serious mental gymnastics coming from Fivetran yesterday, or am I missing something? The CEO had a data lake hit piece in Forbes ('The Business Case For Ditching Your Data Lake'[1]) and almost simultaneously authored this 'Databricks' Delta Lake is not a data lake' blog.

Felt like the CEO wanted to make a strong point, but has a relationship with Databricks that he didn't want to undermine, so did some reclassification gymnastics to make it all fit. But I am probably too cynical.

Finally, I beg for mercy from the 'lakehouse'. It is a bridge too far ;-)

[1] https://www.forbes.com/sites/forbestechcouncil/2021/02/01/th...

willvarfar · on Feb 1, 2021

The key gap as I see it is that databricks doesn’t support multi-table transactions. And that is why you can’t treat it as though it as a drop in functional replacement for an rdbms.

(Why no vector clocks in the manifest files or something?)

georgewfraser · on Feb 1, 2021

Interestingly, BigQuery is also missing multi-table transactions. There are ways to live without this feature, but I agree it's a gap.

rymurr · on Feb 1, 2021

Check out https://projectnessie.org it adds multi-table transactions to Databricks Delta.

disclaimer: an author of Nessie

rymurr · on Feb 1, 2021

Check out https://searchdatamanagement.techtarget.com/news/252495619/A... the convergence between data lake and data warehouse idea is starting to spread rapidly.

anonu · on Feb 1, 2021

So is Databricks like AWS RDS with some Lambda functions built around it?

pbourke · on Feb 1, 2021

No, Databricks is a distribution of Apache Spark with some value-added features such as the Delta Lake data format and a clean UI for hosting notebooks, doing cluster admin, etc

fs111 · on Feb 1, 2021

delta lake is open source too btw

rymurr · on Feb 1, 2021

only in name. check out the lack of traction for a lot of PRs on https://github.com/delta/delta-oss . Iceberg (https://iceberg.apache.org) is comparable but actually OSS.

snidane · on Feb 1, 2021

https://github.com/delta/delta-oss

link seems to be broken.

ywywyw · on Feb 2, 2021

this is the right url: https://github.com/delta-io/delta