Snowflake and BigQuery are quite expensive for big datasets.
Databricks Delta Lake has some use cases, but there are some aspects that are rough around the edges. vacuuming is very slow, the design decision to store partitioned data on disk in folders has certain pros / cons, etc.
There are a lot of great products in the data lake space, but lots more innovation is needed going forward.
> Snowflake and BigQuery are quite expensive for big datasets.
This is just flatly untrue---they are nearly the same cost/TB as object storage, and they store everything in compressed columnarized format, so they're about as efficient as you can get.
I have heard many people make the same claim. I can't figure it out. Is there something wrong with my calculator???
For sure, but you're not going to fix that by making your own data lake using, for example, Parquet-on-S3. You're still going to pay the cost of compute when you analyze that data, and a well-optimized commercial database system is extremely hard to beat. Even if you look at Presto, and you exclude the people costs of managing it yourself, it still can't beat the commercial systems: https://fivetran.com/blog/warehouse-benchmark
That's because Snowflake happens to charge relatively little for backing storage at the moment. As I recall it's about the same as object storage. Virtual data warehouses on the other hand are quite expensive, especially if they have a lot of compute.
Databricks Delta Lake has some use cases, but there are some aspects that are rough around the edges. vacuuming is very slow, the design decision to store partitioned data on disk in folders has certain pros / cons, etc.
There are a lot of great products in the data lake space, but lots more innovation is needed going forward.