pg_duckdb: Splicing Duck and Elephant DNA

leetrout · on Aug 18, 2024

If / as we move to analytic workloads it would be awesome to see postgres pickup support for AsOf, time_bucket, etc that duckdb and timescale have.

I don't and never have enjoyed SQL and I much prefer the ergonomics of time_bucket to date_bin.

For example, I would do this in duckdb:

  SELECT
    count(*) as y
    , time_bucket(interval '2 weeks', at::timestamp) as x
  FROM analytics
  WHERE some_bool AND some_haystack = 'needle'
  GROUP BY x
  ORDER BY x

In postgres it looks more like:

  with counts as (
    SELECT 
        date_bin('1 hour'::interval, at, (now() - interval '2 weeks')::timestamp)
        , count(*) c
    FROM analytics a
    WHERE some_bool AND some_haystack = 'needle'
    GROUP BY date_bin
  )
  select series as x, coalesce(counts.c, 0) as y
  from generate_series(
    (now() - interval '2 weeks')::timestamp, 
    now()::timestamp,
    interval '1 hour'
  ) series
  LEFT JOIN counts
  ON counts.date_bin = series;

wuputah · on Aug 18, 2024

I think it will be straightforward to expose time_bucket in pg_duckdb. Feel free to open an issue for the feature.

philippemnoel · on Aug 18, 2024

ParadeDB has an open PR for time_bucket support via DuckDB here: https://github.com/paradedb/pg_analytics/pull/32

craigkerstiens · on Aug 18, 2024

Very much agreed with this general idea, and believe a lot of this was inspired by the team we hired at Crunchy Data to build it as they were socializing it for a while. Looking forward to pg_duckdb advancing in time for now it still seems pretty early and has some maturing to do. As others have said, it needs to be a bit more stable and production grade. But the opportunity is very much there.

We recently submitted our (Crunchy Bridge for Analytics-at most broad level based on same idea) benchmark for clickbench by clickhouse (https://benchmark.clickhouse.com/) which puts us at #6 overall amongst managed service providers and gives a real viable option for Postgres as an analytics database (at least per clickbench). Also of note there are a number of other Postgres variations such as ParadeDB that are definitely not 1000x slower than Clickhouse or DuckDB.

coatue · on Aug 18, 2024

Hey Craig, for the public record- pg_duckdb was not inspired by the team at Crunchy Data. Our early mvp version, "pg_quack" was made public (apache 2.0) on February 2nd. About 2 months later, Crunchy's analytics product shipped on April 30th. If you were working on it around a similar time it was a coincidence. Let's call it great minds think alike.

scirob · on Aug 18, 2024

Craig fan here, agree it's zeitgeist and I'm loving the PG ecosystem

leetrout · on Aug 18, 2024

I just did a project for a YC startup and we reverted to postgres from duckdb+sqlite for concerns enterprises might not see the local file combo as mature / professional.

Really excited about the idea of being able to have everything under the postgres umbrella even with sacrifices.

From the engineering side I have nothing but good things to say about duckdb.

I opened up the database to the frontend (it's an internal reporting tool not unlike grafana and I filtered queries through an allowlist) and it was pure delight to have the metrics queries right next to the graph. Very rapid iterations.

philippemnoel · on Aug 18, 2024

As Craig said, Crunchy has a very enterprise mature offering for analytics in Postgres and are very much leading the charge here. ParadeDB is built in a similar way, also ranking high on ClickBench, and is available in the open-source as well.

I'm hopeful the pg_duckdb project will mature enough to be a stable foundation for ParadeDB and others, but that appears to be a matter of MotherDuck and how much they're willing to push this forward.

scirob · on Aug 18, 2024

Paradedb choose GPL. So i could see pg_duckdb accelerating past them. But then you never know each of them can change the license at any time

philippemnoel · on Aug 18, 2024

ParadeDB itself is AGPL, yes. Our core offering is pg_search, which offers Elasticsearch inside Postgres. What we build will be AGPL, and if pg_duckdb moves forward we will be happy to rebuild our analytics offering on top of it.

coatue · on Aug 18, 2024

Hey Phil, the blogpost says pg_duckdb is being taken forward by duckdb labs, hydra, motherduck, neon, and microsoft azure. We're fully invested in developing pg_duckdb and I'm happy to work collaboratively- do you have something valuable to add to pg_duckdb?

philippemnoel · on Aug 18, 2024

There is a lot missing from it, as you know. We'd be happy to be part of the project if we get commit access/even partnership :)

nikita · on Aug 18, 2024

Are you guys planning to opensource your work at crunchy?

wenc · on Aug 17, 2024

Would be helpful to list the features. This link has the details:

https://github.com/duckdb/pg_duckdb

Sounds like it would be useful for Postgres users to interact with Parquet and CSV data within a single SQL query and in a performant way (due to DuckDB's vectorization).

nikita · on Aug 17, 2024

Postgres IS missing an analytics engine. benchmark.clickhouse.com puts it at the bottom of the list and ~1000x slower than @duckdb and @ClickHouseDB.

Here are the scenarios and how to address them

1. Query Parquet and Iceberg from Postgres. When Parquet files are stored in S3 Postgres should be able to run analytical queries on them.

2. Postgres should allow creation of columnstore tables inside Postgres storage subsystem. Analytical queries on top of these table should be FAST. Top 10 on Clickbench fast. This allows to run analytics without S3 and have super low latencies for analytics.

3. Postgres should allow creation of secondary columnstore indexes to speed up analytical queries in mixed workloads. This is super useful for Oracle migrations since Oracle had this feature for a while.

So How do we get there? 10 years ago it would be a MASSIVE project, but today we have @duckdb - super fast analytical engine with an open license. The work is still not trivial, but it is much much simpler.

First you need to integrate an analytical query processor into Postgres and today @duckdblabs announced github.com/duckdb/pg_duck…. Yay and congrats!

This plugin runs duckdb alongside with Postgres and integrated Postgres syntax with the @duckdb query processor (QP)

With that it now can trivially query external files from S3. This addresses scenario 1.

Building columnar table requires either implementing columnar storage from scratch or integrating duckdb storage into the Postgres subsystem. You can of course let duckdb create duckdb files on local disk, but then all the Postgres machinery: replication, backup, recovery won't work

Duckdb tables have to mapped into 8kb Postgres pages pushed through the Postgres WAL for replication, recovery and transactionality. This will give us scenario 2

Scenario 3 is even more work. You need secondary index maintenance and it will require hybrid query execution. We will need to modify Postgres executor so that it can mix and match regular Postgres query operators and "vectorized" query operators from duckdb. Or built vectorized operators into Postgres

Scenarios 2 and 3 will take some time, but I'm excited for this roadmap: this will unlock a huge world for millions of Postgres users and simplify the lives of many developers dealing with moving data between transactional and analytical systems.

jerrysievert · on Aug 17, 2024

re: columnar - https://github.com/hydradatabase/hydra/tree/main/columnar

a much updated fork of citus' columnar using Postgres tableam.

re: duckdb handling storage - https://github.com/hydradatabase/pg_quack/tree/branch-0.0.1

an earlier implementation

philippemnoel · on Aug 18, 2024

Hey Jerry, it's nice to see you here :) Truth is, the work you guys did at Hydra with the fork of Citus columnar is the more Postgres-idiomatic analytics engine in Postgres. It's a shame it was abandoned.

jerrysievert · on Aug 18, 2024

> It's a shame it was abandoned.

I'm not sure that saying it's abandoned is quite accurate. looking at it from a (now outside) lens, I see it as more of a hard engineering project.

I still use columnar: it's very mature at this point, but any further (large) improvements would require a lot larger engineering efforts.

philippemnoel · on Aug 18, 2024

It seems like the team behind it is now working on pg_duckdb, though, which likely will supersede Citus columnar eventually? Is that not the goal with it?

philippemnoel · on Aug 18, 2024

This is a cool roadmap. I'm excited for Neon to get in the mix here, it's going to be a lot of work. My concern is... There are very few companies who care about doing the work in such a Postgres-idiomatic way. I suspect Neon will have to be one of the main backers of such a large project for it to pick up.

leetrout · on Aug 18, 2024

I would be all about hive partitioning for postgresified tables, too.

https://duckdb.org/docs/data/partitioning/hive_partitioning....

saisrirampur · on Aug 18, 2024

Excited about the work here. However, my 2 cents - for this to become a reality (serious production use at scale), I don’t think it’s just based on the choice of the analytical engine (here, DuckDB), but rather on how well the Postgres extension is built. The Postgres extension framework is complex, still maturing, and doesn’t offer full flexibility to implement features. We saw this closely at Citus. It was a deterrent to competing with native analytical databases like ClickHouse and Snowflake. A bunch of customers, including CloudFlare and Heap, switched from Citus to ClickHouse and SingleStore, respectively. This was one of the inspirations to start PeerDB , to make it magical for customers to move data from Postgres to native and purpose-built analytical databases like ClickHouse.

Being a Postgres fan, Good luck and best wishes with the effort here!

scirob · on Aug 18, 2024

how are we doing with each of these extensions [ pg_duckdb, pg_analytics (paradedb) , crunchy analytics ] sql syntax being compatible ? (or maybe standardizing on https://substrait.io/ ) .

It would be great if one could have a diversity of postgres's in a data mesh but you can execute the same exact sql on them

philippemnoel · on Aug 18, 2024

ParadeDB here -- agreed. The syntax for ParadeDB and Crunchy Analytics is similar, since we're both built as Postgres foreign data wrappers. pg_duckdb is the odd one out here, being built as a custom scan. I am confident they will converge in the future, though :)

timenova · on Aug 18, 2024

Looking forward to this getting supported on Neon!

nikita · on Aug 18, 2024

Very soon, it needs to get a bit more stable.

grepLeigh · on Aug 18, 2024

Does the pg_duckdb extension double-write data to DuckDB? I skimmed the blog post and didn't understand if the extension adds some capabilities to Postgres, or just makes it easier to maintain a consistent data model across two database engines.

StimDeck · on Aug 23, 2024

They had to use duck DNA and not pig DNA because, as everyone already knows, pig and elephant DNA just don’t splice.

coatue · on Aug 18, 2024

repo: https://github.com/duckdb/pg_duckdb

netcraft · on Aug 18, 2024

this looks awesome, I cant wait to play with it! we need to get this into RDS as soon as we can