Thanks for sharing! Seems like this is a dbt-centric lineage tool that surfaces failed tests in the lineage itself?
Unlike a data observability platform like Monte Carlo which proactively monitors data, am I correct in assuming that your solution is less focused on data observability (i.e. monitoring production data and conducting root cause analysis / impact analysis) and more on ensuring reliable CI/CD?
I wouldn't personally draw such a bright line between monitoring and reliable CI/CD. That division definitely exists but partly as a product of the complexity introduced by fragmented data systems. In some ways an ideal world is one where the need for extraordinarily complex monitoring tools is actually pretty limited because we had tools to validate end to end data pipelines before making code changes if that makes sense.
We actually already do data monitoring as well although we haven't built the specific alerting features of Monte Carlo. There are quite a few tools that do that really well so it's not our focus at the moment.
At Monte Carlo, we did some work on root cause analysis for data failures, like ETL job failures, timeouts, data delays, etc. I think there's a lot that can be done from a data science perspective to automate RCA, or provide better insights into data pipeline problems.
We put together this blog post, showing how an orchestration DAG (like a dbt schedule DAG) can be converted into a Bayesian network. You can then ask causal attribution questions in the form of conditional probability queries against the BN.
The idea is still pretty basic / preliminary, but I think it could be extended in all sorts of interesting ways e.g. attributing bad row-level data to upstream transformations, etc.
Seems like the the key pillars are: freshness, volume, schema, distribution, and lineage.
Makes more sense this way, I think...
If you think about metrics, traces, and logs (software observability pillars) as three distinct things, it's hard to view metadata separate from metadata, lineage, or logs. Metadata is kind of the glue that holds everything together.
This article has more relevant sources, IMO, even if it is from a SaaS vendor.
Unlike a data observability platform like Monte Carlo which proactively monitors data, am I correct in assuming that your solution is less focused on data observability (i.e. monitoring production data and conducting root cause analysis / impact analysis) and more on ensuring reliable CI/CD?