I know I'm repeating my self (it must be my third comment on HN about this topic...

belter · on June 11, 2024

Curious about your post, and DuckDB, since have seen many previous post here on HN about it...I just took 10 min to do some quick tests, and experiment with DuckDB for the first time. :-)

While I am a Linux user, tried this on a available Windows 10 machine (I know...Yuck!)

1) Setup and install no prob. Tracked the duckdb process memory usage with PowerShell, like this:

  Get-Process -Name duckdb | Select-Object Name, @{Name="Memory (MB)";Expression={[math]::round($_.WorkingSet64/1MB,2)}}

2) Used this simulated netflix table dataset available from an S3 bucket, as used in this example blog. Installed the aws extension not the one mentioned in the blog: https://motherduck.com/blog/duckdb-tutorial-for-beginners/

Section in the blog: "FIRST ANALYTICS PROJECT"

Table has 7100 rows as seen like this:

   SELECT COUNT(*) AS RowCount FROM netflix;

3) Did 8 different queries using DISTINCT ON / GROUP BY

   The one below, being one example:

    SELECT DISTINCT ON (Title) 
      Title, 
      "As of", 
      Rank 
    FROM netflix 
    ORDER BY Title, "As of" DESC;

I am not seeing any out of memory or memory leak from these quick tests.

I also tested, with the parquet file from the thread mentioned below by mgt19937. It is 89475 rows. Did some complex DISTINCT ON / GROUP BY on it, without seeing neither explosive memory use or something similar to a memory leak.

Do you have a more specific example?

sethammons · on June 11, 2024

> 7100 rows

For analytic queries, I don't feel that is anywhere near what I've seen at multiple companies who gave opted for columnar storage. That would be at most a few seconds of incoming data.

With so few rows, I would not be surprised if you could use standard command line tools to get the same results from a text file of similar size in an acceptable time.

belter · on June 11, 2024

Certainly not. We typically would talk about hundreds of millions to billions of rows.

Just trying to reproduce the use case...Also tried with the Parquet file from the support ticket. Around 90,000 rows...

LunaSea · on June 11, 2024

Quick example:

``` COPY ( SELECT DISTINCT ON (b, c) * FROM READ_PARQUET('input.parquet') ORDER BY a DESC ) TO 'output.parquet' ( FORMAT PARQUET, COMPRESSION 'ZSTD' ) ; ```

Where the input file has 25M rows (500Mb in Parquet format) containing 4 columns, a and b are BIGINTs and c and d are VARCHARs.

On a Mac Book Pro M1 with 8GB of RAM (16x the original file size), the query will not finish.

This is a query that could very easily be optimised to take little amounts of space (hash the DISTINCT ON key, and replace in-place the already seen values if the value of "a" is larger than the one that already exists.)

belter · on June 11, 2024

With the information in your example... I created a parquet file with 50 million rows, random data, same data types. Parquet file is 391 MB on disk (NTFS)

Query will complete, but in aprox 3,5 to 4 min, you will need up to 14 GB of memory. (4 Core, Win10, 32GB RAM).

You can see below, memory usage in MB, throughout the query, sampled at 15 sec interval.

  duckdb      321.01 -> Start Query
  duckdb     6302.12
  duckdb    13918.04
  duckdb    10963.74
  duckdb     8586.76
  duckdb     7613.86
  duckdb     6749.53
  duckdb     5990.96
  duckdb     5293.35
  duckdb     4205.53
  duckdb     3153.59
  duckdb     1482.86
  duckdb      386.29 -> End Query

So yes, there are some opportunities for optimization here :-)

LunaSea · on June 11, 2024

Thanks for the benchmarks! :)

Indeed, 14GB seems really high for a 400MB Parquet file, that's a 35x multiple on the base file size.

Of course, the data is compressed on disk, but even the uncompressed data isn't that large so I believe indeed that quite a lot of optimisations are still possible.

wenc · on June 11, 2024

It’s also the aggregation operation. If there are many unique groups it can take a lot of memory.

Newer DuckDbs are able to handle out of core operations better. But in general just because data fits in memory doesn’t mean the operation will — and as I said 8GB is very limited memory so it will entail spilling to disk.

https://duckdb.org/2024/03/29/external-aggregation.html

wenc · on June 11, 2024

What is the error message you’re getting? Or is it simply that the query will not finish? (Does it continue to run?)

Parquet files are compressed, and many analytic operations require more memory than the on disk size. When you don’t have enough ram DuckDb has to switch to out of core mode which is slower. (It’s the classic performance trade off)

8gb of ram is not enough usually to expect performance from analytic operations - I usually have minimum 16. My current instance is remote which has 256gb ram. I never run out of ram and DuckDb never fails and runs super fast.

sirfz · on June 11, 2024

In general DuckDB is great but I had similar out-of-memory issues specifically when using "distinct on" and "unnest". In such cases I usually delegate some queries to chdb / clickhouse-local instead.

DuckDB has been evolving nicely, I especially like their SQL dialect (things like list(column1 order by column2), columns regex / replace / etc) and trust they'll eventually resolve the memory issues (hopefully).

sztanko · on June 11, 2024

Thanks for repeating yourself; this comment potentially influences my decision-making about DuckDB. Good to know about the negatives too.

At what data volumes does it start erroring out? Are these volumes larger than RAM? Is there a minimal example to reproduce it? Is this ticket related to your issue? https://github.com/duckdb/duckdb/issues/12480

JacobiX · on June 11, 2024

In my experience, recent versions work well for out-of-core computation (I use it frequently). What is the size of the Parquet file that you are using? - I can't find the details in you previous comments.

mgt19937 · on June 11, 2024

Sounds interesting. Is there a open issue for this? I found https://github.com/duckdb/duckdb/issues/8505 but it seems that that specific issue is closed.

CodeCompost · on June 11, 2024

Sounds like the very early versions of EF Core which did not translate C# GroupBy into SQL GROUP BY, but instead loaded the /entire/ data set into memory and did the aggregate there.

dmw_ng · on June 11, 2024

I've had similar times with DuckDB, it feels nicer to use on the surface but in terms of perf and actual function I've had a better experience with clickhouse-local.

wenc · on June 11, 2024

Are you using it for simple SQL retrieval or complex analytic queries? They’re both similar for the former use case, but DuckDB — being an analytic engine — supports the latter use case much better.