I can't argue that dbt isn't great -- it is. It is, however , unfortunate that P...

oroul · on Dec 3, 2021

Does it matter that dbt is written in Python? dbt models are still SQL at heart. Sure there's the addition of Jinja to enable references for data lineage and configuration, but it's all compiled to SQL with fine-grained control over what's produced.

Forgive me if I come across as combative, but I don't understand generic appeals to things like a language being rigorous. Rigorous to what end? What problem is it solving where that is required? If you know something specific in this domain where that level of rigor is needed, why not share what it is?

There are a lot of problems in the analytics space (and a lot of opportunity for new practices, tools, and businesses), but I would argue that at the end of the day the primary issue is whether or not data producers choose to model data such that it is legible outside of the system that produced it much more than it is about any particular language or methodology.

beckingz · on Dec 3, 2021

For typical datasets (~95% of companies have medium data on the order of gigabyte), you are 100% correct that the data modeling / formatting is the biggest challenge.

Having well modeled data that matches the business domain is a massive (2-10x) productivity boost for most business analysis.

theplague42 · on Dec 3, 2021

Well part of the benefit is rapid development; it's mind-boggling how quickly someone can stand up a dbt project and begin to iterate on transforms. Using Python/SQL/JSON (at small/medium) scales keeps the data stack consistent and lowers the barrier to entry. No reason to prematurely optimize when your bottleneck is the modeling and not the actual data volume.

tomnipotent · on Dec 3, 2021

> using rigorous languages and schemas

And what value does it add?

A vast majority of companies are working with < 1TB of data that sits neatly in a single cloud database. Python and tools like dbt are fantastic for a huge class of problems without compromising workflow velocity, and pushing transformations into SQL removes most Python-bound performance constraints.

Changing singer.io to require Thrift or protobuf schemas isn't going to add the value you think it is. How data is shuffled between systems is considerably less important and time consuming than figuring out how to put that data to work.

glogla · on Dec 3, 2021

Singer and dbt are different - dbt orchestrates stuff (evaluates macros and runs SQLs) while Singer actually has the data flow through it. So rewriting Singer in something fast (I guess Java since even the most obscure database has JDBC driver for it) would definitely help.

It would only help once you start shuttling around terabytes.

tomnipotent · on Dec 3, 2021

These tools were not built for multi-tb workloads, and pointing out that as a deficiency when it was clearly not a design goal is a misleading argument.

killingtime74 · on Dec 3, 2021

There are only languages people hate and ones no one use

pphysch · on Dec 3, 2021

and golang

jgraettinger1 · on Dec 3, 2021

Our own approach is to keep singer.io et al and JSON (flexible! streaming capable!), but deeply invest in JSON Schema to a) statically infer and map into other kinds of schema on your behalf -- TypeScript, Elastic Search, and SQL DDL so far -- and b) optimize validation so it's fast enough to be in the high-scale critical path. If you validate every document before it's read or written by your hacky transform, you've gone a long way towards limiting the bug blast radius.

lbotos · on Dec 3, 2021

I think this is what meltano is trying to help with?

https://www.meltano.com/

theptip · on Dec 3, 2021

Not really - meltano uses singer (ie extracts/loads data in JSON form) and dbt (for transformation, in the ELT pattern).

It’s a good tool (I use it), but the concerns GP is raising are very much its weaknesses.

smoyer · on Dec 3, 2021

In the other hand, it's convenient to use the same language as Jupyter Notebooks.

amznbyebyebye · on Dec 3, 2021

Can you give a before/after of what the desired state would look like?

jeanlaf · on Dec 3, 2021

Might be worth looking into Airbyte!

thejosh · on Dec 3, 2021

Airbyte looks great, and the UI is fantastic.

It uses Java for its connectors, and looks great but has issues importing a massive dataset into S3 as there is a chunk limit of 10k, and each chunk size is 5mb :).

jgraettinger1 · on Dec 3, 2021

We've used https://github.com/estuary/connectors/pkgs/container/source-... to load data sets in the many terabytes. Caveat that, while it's implemented to Airbyte's spec, we've only used it with Flow.

jeanlaf · on Dec 3, 2021

That’s the current focus of the team. Consolidating those connectors :)

lmm · on Dec 3, 2021

Something like Databricks?