Curious about your post, and DuckDB, since have seen many previous post here on HN about it...I just took 10 min to do some quick tests, and experiment with DuckDB for the first time. :-)
While I am a Linux user, tried this on a available Windows 10 machine (I know...Yuck!)
1) Setup and install no prob. Tracked the duckdb process memory usage with PowerShell, like this:
3) Did 8 different queries using DISTINCT ON / GROUP BY
The one below, being one example:
SELECT DISTINCT ON (Title)
Title,
"As of",
Rank
FROM netflix
ORDER BY Title, "As of" DESC;
I am not seeing any out of memory or memory leak from these quick tests.
I also tested, with the parquet file from the thread mentioned below by mgt19937. It is 89475 rows. Did some complex DISTINCT ON / GROUP BY on it, without seeing neither explosive memory use or something similar to a memory leak.
For analytic queries, I don't feel that is anywhere near what I've seen at multiple companies who gave opted for columnar storage. That would be at most a few seconds of incoming data.
With so few rows, I would not be surprised if you could use standard command line tools to get the same results from a text file of similar size in an acceptable time.
```
COPY
(
SELECT
DISTINCT ON (b, c)
*
FROM
READ_PARQUET('input.parquet')
ORDER BY
a DESC
)
TO
'output.parquet'
(
FORMAT PARQUET,
COMPRESSION 'ZSTD'
)
;
```
Where the input file has 25M rows (500Mb in Parquet format) containing 4 columns, a and b are BIGINTs and c and d are VARCHARs.
On a Mac Book Pro M1 with 8GB of RAM (16x the original file size), the query will not finish.
This is a query that could very easily be optimised to take little amounts of space (hash the DISTINCT ON key, and replace in-place the already seen values if the value of "a" is larger than the one that already exists.)
With the information in your example... I created a parquet file with 50 million rows, random data, same data types. Parquet file is 391 MB on disk (NTFS)
Query will complete, but in aprox 3,5 to 4 min, you will need up to 14 GB of memory. (4 Core, Win10, 32GB RAM).
You can see below, memory usage in MB, throughout the query, sampled at 15 sec interval.
Indeed, 14GB seems really high for a 400MB Parquet file, that's a 35x multiple on the base file size.
Of course, the data is compressed on disk, but even the uncompressed data isn't that large so I believe indeed that quite a lot of optimisations are still possible.
It’s also the aggregation operation. If there are many unique groups it can take a lot of memory.
Newer DuckDbs are able to handle out of core operations better. But in general just because data fits in memory doesn’t mean the operation will — and as I said 8GB is very limited memory so it will entail spilling to disk.
What is the error message you’re getting? Or is it simply that the query will not finish? (Does it continue to run?)
Parquet files are compressed, and many analytic operations require more memory than the on disk size. When you don’t have enough ram DuckDb has to switch to out of core mode which is slower. (It’s the classic performance trade off)
8gb of ram is not enough usually to expect performance from analytic operations - I usually have minimum 16. My current instance is remote which has 256gb ram. I never run out of ram and DuckDb never fails and runs super fast.
In general DuckDB is great but I had similar out-of-memory issues specifically when using "distinct on" and "unnest". In such cases I usually delegate some queries to chdb / clickhouse-local instead.
DuckDB has been evolving nicely, I especially like their SQL dialect (things like list(column1 order by column2), columns regex / replace / etc) and trust they'll eventually resolve the memory issues (hopefully).
Thanks for repeating yourself; this comment potentially influences my decision-making about DuckDB. Good to know about the negatives too.
At what data volumes does it start erroring out? Are these volumes larger than RAM? Is there a minimal example to reproduce it? Is this ticket related to your issue? https://github.com/duckdb/duckdb/issues/12480
In my experience, recent versions work well for out-of-core computation (I use it frequently). What is the size of the Parquet file that you are using? - I can't find the details in you previous comments.
Sounds like the very early versions of EF Core which did not translate C# GroupBy into SQL GROUP BY, but instead loaded the /entire/ data set into memory and did the aggregate there.
I've had similar times with DuckDB, it feels nicer to use on the surface but in terms of perf and actual function I've had a better experience with clickhouse-local.
Are you using it for simple SQL retrieval or complex analytic queries? They’re both similar for the former use case, but DuckDB — being an analytic engine — supports the latter use case much better.
DuckDB will error-out with an out-of-memory exception in very simple DISTINCT ON / GROUP BY queries.
Even with a temporary file, an on-disk database and not keeping the initial order.
On any version of DuckDB.