Hacker Newsnew | past | comments | ask | show | jobs | submit | tingfirst's commentslogin

All streaming processors face the same fundamental problem:

Streaming joins require maintaining state for both sides of the join

High-cardinality data (millions of unique keys) means huge state sizes

Traditional approach: Keep everything in memory will make memory exhausted

The high-cardinality join memory problem isn't unique to Timeplus. Apache Flink also uses hybrid hash joins that spill to disk (RocksDB) when memory fills, Materialize shares indexed state across multiple queries (but still requires keeping full datasets in memory), and RisingWave stores state in cloud object storage (S3/GCS) with LRU caching for hot data. What makes Timeplus different is its purpose-built optimization for the Pareto Principle, where a tiny fraction of data generates the vast majority of activity - keeping hot data in memory and cold data on disk for dramatic memory savings.


Redpanda + Timeplus, the perfect pair for data streaming developers. No JVM, ZK ...


Probably the smallest yet most powerful binary for real-time, incremental SQL data processing, end to end!


a single binary is nice, but how to scale?


here is the overview of the Timeplus Cluster https://docs.timeplus.com/cluster#overview

basically all the cluster nodes are deployed with the same binary with extra configurations.


Is there a native SQL pipeline tool for ClickHouse that processes real-time data incrementally, with low latency, large throughput and high efficiency, similar to Snowflake’s Dynamic Tables?

[1] Dynamic Tables: One of Snowflake’s Fastest-Adopted Features: https://www.snowflake.com/en/blog/reimagine-batch-streaming-...


Dynamic Tables are interesting for declarative streaming. In the ClickHouse ecosystem, you might want to look at materialized views combined with streaming engines.

For real-time transformations, there are a few approaches: - Native ClickHouse MaterializedViews with AggregatingMergeTree - Stream processors that write to ClickHouse (Flink, Spark Streaming) - Streaming SQL engines that can read/write ClickHouse

We've been working on streaming SQL at Proton (github.com/timeplus-io/proton) which handles similar use cases - continuous queries that maintain state and can write results back to ClickHouse. The key difference from Dynamic Tables is handling unbounded streams vs micro-batches.

What's your specific use case? Happy to discuss the tradeoffs.


Data sources are usually in Kafka, or other operational databases like Postgres or MySQL

1. Table A : fact events, high-throughput (10k~1M eps), high-cardinality

2. Table B, C, D : couple of dimension tables (fast or slow changing).

The use case is straightforward : join/enrich/lookup everything into one big flattened, analytics-friendly table into ClickHouse.

What’s the best pipeline approach to achieve this in real-time and efficiently?


Consistently we heard about ClickHouse has very limited materialized views that can't handle real-time pipeline fast efficiently enough. would love to see more comments here.


there are some limitations as I know:

1. Insert Performance Degradation

Users frequently complain that materialized views significantly slow down insert performance, especially when having multiple MVs on a single table.

2. Streaming Data Patterns

This is critical for ClickHouse materialized views. Streaming data often arrives in frequent, small batches, but ClickHouse performs best when ingesting data in larger batches. The materialized views' transformation query runs synchronously within the INSERT transaction for every single batch, making the fixed overhead disproportionately large for small batches

3. Block-Level Processing Limitations

MVs in ClickHouse operate only on the data blocks being inserted at that moment. When performing aggregation, a single group from the original dataset may have multiple entries in the target table since grouping is applied only to the current insert block.

4. JOIN Limitations and Memory Issues

Materialized views with JOINs are problematic because MVs only trigger on the left-most table. It's also inefficient to update the view upon the right join table since it needs to recreate a hash table each time, or else keeping a large hash table and consuming a lot of memory.

5. Reprocessing historical data requires manual ALTER TABLE operations.

6. Each materialized view will create a new part from the block over which it runs - potentially causing the "Too Many Parts" issue


For parallel programming, what's OS-level difference compared to languages like Python or modern C++?

Domain.spawn (fun _ -> print_endline "I ran in parallel")

Anyway, love the simplicity of this expression!


At the OS level, OCaml 5+ and C++ can spawn OS threads. Python can't.


AI can be hallucination but real-time detection is key


re EPS and CPU utilization, WS still performs better than SSE?


The tests didn't show much of a performance difference between SSE and WS across these scenarios, Both technologies perform similarly in most use cases.

WS has a lower client CPU utilization compared to SSE, meaning WebSocket can better leverage available CPU resources.


For OCaml users interested in data streaming processing (similar to Flink or Spark), but looking for a faster and more efficient option, check out this OCaml plugin Timeplus Proton. Concise, safe, highly performant and fun

-> Streaming Queries - Process large datasets with constant memory usage

-> Async Inserts - High-throughput data ingestion with automatic batching

-> Compression - LZ4 and ZSTD support for reduced network overhead

-> TLS Security - Secure connections with certificate validation

-> Connection Pooling - Efficient resource management for high-concurrency applications

-> Rich Data Types - Full support for #ClickHouse types including Arrays, Maps, Enums, DateTime64

-> Idiomatic OCaml - Functional API leveraging OCaml's strengths


For OCaml users interested in data streaming processing (similar to Flink or Spark), but looking for a faster and more efficient option, check out this OCaml plugin Timeplus Proton. Concise, safe, highly performant and fun!

[1] https://github.com/mfreeman451/proton-ocaml-driver


Pretty cool to see a C++ R/W Iceberg client without dependency, and even better open-sourced. The pipeline is all about processing and routing, ideally, to open and flexible destination with no lock-in and long-term retention. Writing into Apache Iceberg is becoming critical to give users real control, rather than into specific data warehouses or lakehouses that are hard to move out.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: