Hmm, I agree there's a market here, but I don't know why I wouldn't just use Sno...

snidane · on Feb 1, 2021

Snowflake and Bigquery will bite you in the ass later on. You can do 80% of the things you will need - which is great for some newbie stuff or for sales presentations. Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.

You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.

Databricks/Spark are just good platforms to help you do something with structured data in your lake. With the recent additions to its execution engine and Delta (strange naming tbh) it will be pretty much the same as Snowflake for you.

mrbungie · on Feb 1, 2021

BigQuery/Snowflake can process Parquet and multiple other formats in Object Storage. You can use them more "freely" if you keep your raw data in open formats.

You need something more complicated than what can be done using BigQuery/Snowflake (that remaining 20%, though I would say 10%)? Export the dataset to CSV/Parquet/Avro/ORC/whatever and process it with anything, including Dataproc/HDInsight/EMR or even Databricks. That's actually a common pattern.

snidane · on Feb 1, 2021

While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.

Can't do a simple

    select * from s3://file.parquet

which you can do in Spark. Having to load it into the data warehouse means that you duplicate your data two times and it is stupidly annoying.

Many times the data doesn't even resemble anything tabular before I structure it in python scripts. Why would I load it inside a warehouse only to then pull it down to do some python processing and loading it back. Which makes the data travel from DWH storage to a data lake and then to my compute cluster and then the same cumbersome roadtrip back. Pretty wasteful. Spark at least allows me to schedule a python function across a cluster while copying only from lake to my compute node and back.

Data warehouses like BQ and Snowflake are great for data scientists after a bunch of engineers slice and dice raw data into clean tables. For anyone working with not yet structured data, data lake wins hands down.

carlineng · on Feb 1, 2021

Not quite -- Snowflake allows you to query data directly from S3 without having to ingest it. You can either query the S3 file directly [1], or define an external table on top of it [2]. If the data is JSON, Avro, XML, or some other "semi-structured" format, then Snowflake's "variant" data type makes this type of data very easy to work with. For raw text or log lines, you can still load to Snowflake and process using regex or UDFs. For unstructured or "complex" data like images and PDFs, then you may need to reach for a different compute environment.

[1] https://docs.snowflake.com/en/user-guide/querying-stage.html [2] https://docs.snowflake.com/en/user-guide/tables-external-int...

snidane · on Feb 2, 2021

Yeah I thought Snowflake external tables would do this but it is not the case.

External table in Snowflake only allows you to ingest data from s3 to their storage (which is also s3 behind the scenes).

Perhaps something has changed since the last time I tried, but when I tried my conclusion was that "external tables" in Snowflake are not what you think they are.

Also I have not seen examples of "select * from s3://file.json" in the links you provided.

I would like to be corrected if I'm mistaken.

dominotw · on Feb 2, 2021

> External table in Snowflake only allows you to ingest data from s3 to their storage

No. External table doesn't do any "ingesting" whatsoever. You can if you want to but you thats not the primary usecase.

https://docs.snowflake.com/en/user-guide/tables-external-int...

" enables querying data stored in files in an external stage as if it were inside a database "

> I have not seen examples of "select * from s3://file.json" in the links you provided.

you cannot directly query a random file off of s3( not sure why someone would want to do that) . Snowflake is a database , not a python script.

You have to build a an external table

" create external table abc location=s3://dir-to-file "

then

" select * from abc "

> "external tables" in Snowflake are not what you think they are.

They are exactly what you think they are. Tables stored externally to snowflake.

snidane · on Feb 2, 2021

Thanks for clarification. I either wasn't able to figure this out back then or things have changes since then.

fphhotchips · on Feb 1, 2021

> Can't do a simple select * from s3://file.parquet

That's not really true; Snowflake can query directly from S3 just fine, even from other clouds. You just need to set up the credentials (or supply them in the query, but that's not usually a good idea).

Disclaimer: I work for Snowflake

chrisjc · on Feb 2, 2021

> While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.

This is not true. Snowflake allows you to create an external stage (pointer to s3) and then query any prefix you want as long as you provide the correct file format arguments or types.

select * from @somestage/someprefix file_format(....)

edit: hadn't refreshed the page and i can see others have already responded to this point.

mrbungie · on Feb 1, 2021

In BQ you can query data as an external table with Hive Partioning. No need for duplication and extremely useful for consuming "Landing Zones" in a Datalake.

Pretty similar as defining a Hive Table and then using any other engine to process it.

PS: BigQuery Omni (now beta) will support object storage solutions from other cloud providers.

dominotw · on Feb 2, 2021

>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.

Snowflake has spark connector too. So I don't know what the difference would be writing a spark job against deltalake vs snowflake.

> 80% of the things you will need - which is great for some newbie stuff or for sales presentations.

This is obviously wrong.

> You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.

We store unstructured data in snowflake. I don't understand why you need a datalake on top of it.

georgewfraser · on Feb 2, 2021

EXACTLY. You absolutely can store unstructured and semi structured data in Snowflake. I find it baffling and at this point a bit irritating that there is this community of people insisting that is not allowed for...some unspecified reason.

snidane · on Feb 2, 2021

Why would I store eg. bunch of html files as string columns in Snowflake, only to download them down, process them in python and load back into some other string table.

I could do the same in s3 for much cheaper.

georgewfraser · on Feb 2, 2021

Because it actually costs the same, and if you process them in Snowflake using SQL or UDFs, you will get your results in seconds and you won't have to manage any of the underlying infrastructure.

snidane · on Feb 2, 2021

Can I do some advanced xml stuff say using lxml with beautifulsoup using UDFs?

marcinzm · on Feb 1, 2021

>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.

Snowflake supports enough SQL constructs to allow for very complicated queries. If that doesn't suit your needs then there's stored procedures and custom javascript UDFs you can write. That covers probably 99+% of the use cases at most companies and usually the rest can be done somewhere else on pre-aggregated data.

snidane · on Feb 2, 2021

1. Why tf would javascript be picked as the UDF language when Python dominates the data world.

2. People usually load complex Python libraries for data processing. I wonder if Snowflake UDF would support that or just allow you to use standard library.

chrisjc · on Feb 2, 2021

1. why not? it's not like you're getting external libraries either way. 2. if you want complex libraries, you can use external functions (aws lambdas, etc) in Snowflake.

georgewfraser · on Feb 1, 2021

It's a lot simpler to use a single system as both your data lake, and your data warehouse. As Databricks gets better and better at the core data warehouse features, it becomes feasible to use it for both. Meanwhile, Snowflake and BQ are coming from the other direction, implementing data lake features. AWS strategy seems to be, just make it easier to have 2 systems and move data back and forth.

willvarfar · on Feb 1, 2021

The reason snowflake has such a high market cap is because it’s customers aren’t paying much now, but they’ll be paying and paying monthly forever. It’s lock-in on a massive scale.

Delta lake is something you can run on data you feel you still have some semblance of control over.

fphhotchips · on Feb 2, 2021

I work for Snowflake and I'm always curious about this way of thinking. How is putting the data in Snowflake any different to any other RDBMS, or even S3, in this regard?

You don't own the underlying storage, true, but there's a defined method of getting data in and a defined method of getting that data out. The API is different to S3, sure, but it's an API all the same.

snidane · on Feb 2, 2021

When I want to process something in Snowflake I would have to copy the data

    1. from snowflake storage to s3
    2. from s3 to my compute node
    3. do the processing
    4. store result in s3
    5. copy result from s3 to snowflake storage

If I only use s3, my data gies like this

    1. s3 to compute node
    2. run my script
    3. result to s3

Perhaps I could use bunch if insert statement and load data into Snowflake through sql straight feom compute node, but then at a great cost.

I don't really see much benefit of using Snowflake for not-yet structured data. For structured data it works wonders. Only downsides there:

    1. Expensive as f when you scale up
    2. can't control cost in pay-per-query model.

dominotw · on Feb 2, 2021

> but they’ll be paying and paying monthly forever. It’s lock-in on a massive scale.

Is making money by tricking customers into product lock in really a good business strategy. Wouldn't the word get out sooner or later?

dont you have to pay aws for s3,emr /data engineers monthly and forever too.

fphhotchips · on Feb 2, 2021

FWIW the only organisation I know that's ever truly locked in data was SAP, through their byzantine licensing and horrific API. Snowflake is no more locked in than anything else with a standard SQL interface.

MikeDelta · on Feb 1, 2021

Their Delta also has SSD caching, which turns out to be logic that stores a local copy of the file you queried for faster re-query. Going to call my lru cache function like that as well...

My company loves them, I think they only do a few things good of which marketing the best, and are not worth the money for data science teams with devops skills. Happy to hear from others if I am wrong.

pram · on Feb 1, 2021

Depends. If you only have a couple workspaces, then no. It's worth the money for a company with lots of data science teams, and one team who is janitoring all the workspace infra. Our company has 20 workspaces for 10 teams already and it would be a nightmare if we expected everyone to fix and manage their own AWS stuff.

opportune · on Feb 2, 2021

I’m one of the apparently odd few who doesn’t like it. It’s slow, overkill for many tasks, and a Swiss Army knife that’s ok at everything but not very good at anything. Debugging it can be a very frustrating experience. They’re trying to turn it into an everything platform so it checks every box in the procurement process and don’t care about actually making it good.

I have a feeling legacy databricks installations will be much derided in 5-10 years time