There are other databases today that do real time analytics (ClickHouse, Apache ...

slotrans · on Aug 16, 2023

Yeah ClickHouse is definitely the way to go here. Its ability to serve queries with low latency and high concurrency is in an entirely different league from Snowflake, Redshift, BigQuery, etc.

biggestdummy · on Aug 16, 2023

StarRocks handles latency and concurrency as well as Clickhouse but also does joins. Less denormalization, and you can use the same platform for traditional BI/ad-hoc queries.

slotrans · on Aug 17, 2023

I wasn't familiar with StarRocks so thanks for calling attention to it.

It appears to make very different tradeoffs in a number of areas so that makes it a potential useful alternative. In particular transactional DML will make it much more convenient for workloads involving mutation. Plus as you suggested, having a proper Cost-Based Optimizer should make joins more efficient (I find ClickHouse joins to be fine for typical OLAP patterns but they do break down in creative queries...)

It's a bummer though that the deployment model is so complicated. One thing I truly like about ClickHouse is its ability to wring every drop of performance out of a single machine, with a super simple operational model. Being able to scale to a cluster is great but having to start there is Not Great.

riku_iki · on Aug 16, 2023

Clickhouse also does joins.

Somehow StarRocks dudes appear in every relevant post with this false claim.

biggestdummy · on Aug 16, 2023

There's a difference between "supports the syntax for joins" and "does joins efficiently enough that they are useful."

My experience with Clickhouse is that its joins are not performant enough to be useful. So the best practice in most cases is to denormalize. I should have been more specific in my earlier comment.

riku_iki · on Aug 16, 2023

ack that anonymous user in internet said he couldn't make CLickhouse joins perform well in his case which he didn't describe

biggestdummy · on Aug 18, 2023

Not "that anonymous user." In my experience, avoiding Join statements is a common best practice for Clickhouse users seeking performant queries on large datasets. A couple examples... https://medium.com/datadenys/optimizing-star-schema-queries-... https://posthog.com/blog/secrets-of-posthog-query-performanc...

riku_iki · on Aug 18, 2023

> avoiding Join statements is a common best practice for Clickhouse users seeking performant queries on large datasets

its common best practice on any database, because if both joined tables don't fit memory, then merge join is O(nlogn) operation which indeed many times slower than querying denormalized schema, which will have linear execution time.

RyanHamilton · on Aug 18, 2023

For real-time and large historical data, open source there's tdengine/questdb, commercial DolphinDB and kdb+. If you only need fast recent data and not large historical embedding is a good solution which means h2/duckdb/sqlite if open source, extremedb if commercial. I've benchmarked and ran applications on most these databases including running real-time analytics.

qoega · on Aug 18, 2023

Open-source ClickHouse also allows both real-time and large historical data.

nhourcard · on Aug 18, 2023

QuestDB, kdb+ and others mentioned are more geared toward time-series workloads, while Clickhouse is more toward OLAP. There are also exciting solutions on the streaming side of things with RisingWave etc.