Hmm, I agree there's a market here, but I don't know why I wouldn't just use Snowflake or Bigquery if what I really wanted was a big-data RDBMS.
Everyone I know who uses Databricks (and they all like it) use it as hosted Spark with S3 integrations, or write... directly to Snowflake. I'm a little skeptical they're going to get traction as a true data lake model
Snowflake and Bigquery will bite you in the ass later on. You can do 80% of the things you will need - which is great for some newbie stuff or for sales presentations. Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.
You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.
Databricks/Spark are just good platforms to help you do something with structured data in your lake. With the recent additions to its execution engine and Delta (strange naming tbh) it will be pretty much the same as Snowflake for you.
BigQuery/Snowflake can process Parquet and multiple other formats in Object Storage. You can use them more "freely" if you keep your raw data in open formats.
You need something more complicated than what can be done using BigQuery/Snowflake (that remaining 20%, though I would say 10%)? Export the dataset to CSV/Parquet/Avro/ORC/whatever and process it with anything, including Dataproc/HDInsight/EMR or even Databricks. That's actually a common pattern.
While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.
Can't do a simple
select * from s3://file.parquet
which you can do in Spark. Having to load it into the data warehouse means that you duplicate your data two times and it is stupidly annoying.
Many times the data doesn't even resemble anything tabular before I structure it in python scripts. Why would I load it inside a warehouse only to then pull it down to do some python processing and loading it back. Which makes the data travel from DWH storage to a data lake and then to my compute cluster and then the same cumbersome roadtrip back. Pretty wasteful. Spark at least allows me to schedule a python function across a cluster while copying only from lake to my compute node and back.
Data warehouses like BQ and Snowflake are great for data scientists after a bunch of engineers slice and dice raw data into clean tables. For anyone working with not yet structured data, data lake wins hands down.
Not quite -- Snowflake allows you to query data directly from S3 without having to ingest it. You can either query the S3 file directly [1], or define an external table on top of it [2]. If the data is JSON, Avro, XML, or some other "semi-structured" format, then Snowflake's "variant" data type makes this type of data very easy to work with. For raw text or log lines, you can still load to Snowflake and process using regex or UDFs. For unstructured or "complex" data like images and PDFs, then you may need to reach for a different compute environment.
Yeah I thought Snowflake external tables would do this but it is not the case.
External table in Snowflake only allows you to ingest data from s3 to their storage (which is also s3 behind the scenes).
Perhaps something has changed since the last time I tried, but when I tried my conclusion was that "external tables" in Snowflake are not what you think they are.
Also I have not seen examples of "select * from s3://file.json" in the links you provided.
> Can't do a simple select * from s3://file.parquet
That's not really true; Snowflake can query directly from S3 just fine, even from other clouds. You just need to set up the credentials (or supply them in the query, but that's not usually a good idea).
> While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.
This is not true. Snowflake allows you to create an external stage (pointer to s3) and then query any prefix you want as long as you provide the correct file format arguments or types.
select * from @somestage/someprefix file_format(....)
edit: hadn't refreshed the page and i can see others have already responded to this point.
In BQ you can query data as an external table with Hive Partioning. No need for duplication and extremely useful for consuming "Landing Zones" in a Datalake.
Pretty similar as defining a Hive Table and then using any other engine to process it.
PS: BigQuery Omni (now beta) will support object storage solutions from other cloud providers.
>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.
Snowflake has spark connector too. So I don't know what the difference would be writing a spark job against deltalake vs snowflake.
> 80% of the things you will need - which is great for some newbie stuff or for sales presentations.
This is obviously wrong.
> You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.
We store unstructured data in snowflake. I don't understand why you need a datalake on top of it.
EXACTLY. You absolutely can store unstructured and semi structured data in Snowflake. I find it baffling and at this point a bit irritating that there is this community of people insisting that is not allowed for...some unspecified reason.
Why would I store eg. bunch of html files as string columns in Snowflake, only to download them down, process them in python and load back into some other string table.
Because it actually costs the same, and if you process them in Snowflake using SQL or UDFs, you will get your results in seconds and you won't have to manage any of the underlying infrastructure.
>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.
Snowflake supports enough SQL constructs to allow for very complicated queries. If that doesn't suit your needs then there's stored procedures and custom javascript UDFs you can write. That covers probably 99+% of the use cases at most companies and usually the rest can be done somewhere else on pre-aggregated data.
1. Why tf would javascript be picked as the UDF language when Python dominates the data world.
2. People usually load complex Python libraries for data processing. I wonder if Snowflake UDF would support that or just allow you to use standard library.
1. why not? it's not like you're getting external libraries either way.
2. if you want complex libraries, you can use external functions (aws lambdas, etc) in Snowflake.
It's a lot simpler to use a single system as both your data lake, and your data warehouse. As Databricks gets better and better at the core data warehouse features, it becomes feasible to use it for both. Meanwhile, Snowflake and BQ are coming from the other direction, implementing data lake features. AWS strategy seems to be, just make it easier to have 2 systems and move data back and forth.
The reason snowflake has such a high market cap is because it’s customers aren’t paying much now, but they’ll be paying and paying monthly forever. It’s lock-in on a massive scale.
Delta lake is something you can run on data you feel you still have some semblance of control over.
I work for Snowflake and I'm always curious about this way of thinking. How is putting the data in Snowflake any different to any other RDBMS, or even S3, in this regard?
You don't own the underlying storage, true, but there's a defined method of getting data in and a defined method of getting that data out. The API is different to S3, sure, but it's an API all the same.
FWIW the only organisation I know that's ever truly locked in data was SAP, through their byzantine licensing and horrific API. Snowflake is no more locked in than anything else with a standard SQL interface.
Their Delta also has SSD caching, which turns out to be logic that stores a local copy of the file you queried for faster re-query. Going to call my lru cache function like that as well...
My company loves them, I think they only do a few things good of which marketing the best, and are not worth the money for data science teams with devops skills. Happy to hear from others if I am wrong.
Depends. If you only have a couple workspaces, then no. It's worth the money for a company with lots of data science teams, and one team who is janitoring all the workspace infra. Our company has 20 workspaces for 10 teams already and it would be a nightmare if we expected everyone to fix and manage their own AWS stuff.
I’m one of the apparently odd few who doesn’t like it. It’s slow, overkill for many tasks, and a Swiss Army knife that’s ok at everything but not very good at anything. Debugging it can be a very frustrating experience. They’re trying to turn it into an everything platform so it checks every box in the procurement process and don’t care about actually making it good.
I have a feeling legacy databricks installations will be much derided in 5-10 years time
Everyone I know who uses Databricks (and they all like it) use it as hosted Spark with S3 integrations, or write... directly to Snowflake. I'm a little skeptical they're going to get traction as a true data lake model