Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Proton, a fast and lightweight alternative to Apache Flink (github.com/timeplus-io)
75 points by jinqueeny on Jan 30, 2024 | hide | past | favorite | 45 comments


>Lightweight: Proton is a single binary (<500MB)

Okay, uhm, does this mean the 500 MB is the memory consumption footprint?

No way someone is shipping a giant 500 MB binary hunk, and the selling point is that it's lightweight?

But 500MB of RAM for a DB also seems tiny. That can't be right..


500MB is the binary size, when you download proton from the GitHub release page. Also available as a docker image. The memory consumption can be 200MB or a few GB, depending on the workload. Comparing to setup Flink/ksqlDB with JVM, Proton is much more lightweight to get started.


From the releases page it looks like the binaries are indeed around 500MB (some platforms actually exceeding that); granted, I am not that familiar with DBs, but calling a 500 MB binary “lightweight” is a little stretch…


1. binary and memory consumption are two different concepts 2. if you compared the size of docker image of Flink and Proton, you will get a better understanding here, flink docker image is 500M after compression and proton is 170M. compare to Flink it is much more lightweighted.


I’m afraid I don’t understand your point: I am not trying to compare Flink to Proton’s docker image sizes; nor do I see how it is relevant. In this case, they are clearly talking about the size of the binary as in “executable of the software you can download”. Both the OP and myself are surprised that a 500MB binary is considered “lightweight”, probably because -even in 2024- “lightweight” should probably still be around the tens of MBs at most.


500MB for such a complete product is tiny! The largest CDN in the world ships 10GB+ binaries. 1G is common for large code bases if you link things statically. The bloat tends to come from transitive dependencies, most direct code is small in size


I was thinking proton as far as particles go isn't lightweight. Neither is Electron web app packaging.


Love the small footprint, more databases should have this, I've recently gotten into building some applications with SQLite and its been a blast.

I went through the FAQ and read through the README on the github page, however I'm unable to figure out some really ideal usecases for proton. Ideally from a product perspective.

I like to have some examples so when I am building in the future I can quickly know off the top of my head, "oh yeah, lets just use proton for this."

Does anyone have a few different applications types which they've built or problems they've solved with proton? Bonus points for external examples with some code.


Thanks for the feedback (I am part of the Timeplus team)

https://docs.timeplus.com/showcases lists a few real world use cases we solve with our customers. Proton is core engine of Timeplus Cloud, which adds extra UI, sources/sinks, multi-tenant etc.

https://docs.timeplus.com/proton-howto lists some of the common tasks for processing real-time data. Basically if you're in the data engineering space, Proton can do many things that Flink/Spark/ksqlDB can do, just faster and more lightweight.

Good point of "oh yeah, lets just use proton for this." Will add more docs and examples for such newly-open-sourced project (we have ran the cloud service for 1.5 years)


Great points! SQLite as an analogy is fantastic for its small footprint. Swift download, deployment, and testing significantly boost dev productivity. Moreover, the dependency-free single-binary can efficiently slash deployment and operational costs. In addition to its outstanding performance, Proton users are specifically requesting data streaming processing, routing, analytics, and actions at the edge or in a hybrid environment before sending to their centralized data warehouses. This stands out as unique capabilities compared to existing complicated stacks.


(PS I am one of the contributors of Timeplus Proton. What I said could be biased.) Thanks for the feedbacks and they totally make sense. We will definitely work on the README, FAQ etc to make key use cases etc more explicit.

A very typical scenario is live query / analysis data in Kafka topics with Proton Streaming SQL and materialize the processed / aggregated results to a downstream system like ClickHouse or a target Kafka topic. What we need to do is 1. curl download the binary 2. create an external stream to point to your Kafka topic 3. run the streaming SQL.


Redpanda Proton ClickHouse: A perfect match as a single-binary approach for a lightweight and high-performance data streaming processing and analytics in one compact box!


I think it’s supposed to be an alternative to Flink SQL/Table API.

The important part of Apache Flink is the stateful streaming and fault tolerance characteristics of Flink. This is not an alternative to Flink as a distributed runtime for event driven applications.


Proton does support stateful streaming processing and has fault toleratance.

Proton save the state in memory and local file system.

The cluster and HA features are in Timeplus Platform which build on top of Proton


yes, it's more an alternative to FlinkSQL. SQL is the main interface, and you can also build User-Defined Function with JavaScript, running in sandbox with V8.

While Flink DataStream or Table API are powerful and flexible, you can easily setup Proton to learn or use streaming SQL, or query historical data. Thousands of ClickHouse functions are available to use. If you like ClickHouse or Redpanda, probably you will like Proton too.


I work for Timeplus, and just to mention for those asking how to use this? One real quick use case was mentioned on HN a few days back. Use Proton to view Hacker News API in any way you want. Why not try something like this with Flink? You could, but Flink's learning curve is sort of high and needs more setup. Proton you can get with a brew install command. A few minutes later ... and you've got HN sliced and diced any way you want. Now apply that to any live data stream - including crypto, stocks, etc. https://bytewax.io/blog/hacking-hacker-news


Is Flink used that much? Last time I've seen it in the "news" it was related to Adobe's CMS, which, well, doesn't exactly have a stellar reputation, so it suffered a bit by association.

And boy, Kafka more and more seems like an early warning system for architecture astronautics. (Not necessarily bad by itself, but often part of hectoliters of manure poured in size 6 wellies)


Flink is used quite a lot in prod in global and there are so many gems in it as well regarding streaming processing and it is still probably state of the art in streaming processing. Timeplus Proton stands on the giant shoulders (Flink is definitely one of them), and tries to provide an easier / more resource efficient solution via (ClickHouse) database technologies. PS, I am one of the contributors of Timeplus Proton, so could be biased.


Flink is used extremely heavy in some rather large organizations.

Also it's very popular in China.


Are we out of names in this industry?


We ran out of good names years ago. Then Java came along and we used up all the words beginning with J. Then the dotcom years scooped up all the domain names and all the twsted up wrds like flickr. Then npm took all the variationsofpunctuation, varations-of-punctuation and variations_of_punctuation. Now there are no virgin words left in the English speaking world.

TBH, I'm amazed that Proton wasn't already taken.


I think it's even funnier that ProtonDB already exists and has nothing to do with this: https://www.protondb.com/.



it's a spicy question.. The full comparison deserves one or a series of blogs

In short, data streaming is getting popular. Apache Flink, Apache Spark, ksqlDB are traditional players. They work well in some cases, but have challenges on easy-to-use, easy-to-deploy, on even performance sometimes

Proton, RisingWave, Materialize, etc are the alternatives. In the end, it's always 'case by case' to pick up the best tools working for your team in your projects.

If you think this is too general, well - Materialize no longer provide the latest code as an open-source software that you can download and try. It turned from a single binary design to cloud-only micro-service - RisingWave has been open-sourced for 1.5 years and released their cloud mid of last year. It builds its own row-based historical storage, while Proton leverages ClickHouse for faster OLAP-like workload. Also there are more SQL functions supported in Proton, because it's powered by ClickHouse. Proton uses local SSD as the main storage and has an option to send older data to object storage. While RisingWave primarily uses object storage, and use local disk as cache. So in general Proton can achieve lower latency for both streaming or historical queries.

I am a big fan of coffee and drink different kinds of coffee over time, even on the same day. It's all about the use cases and preferences. This could be applied to choose open-source tools to query live data. In the last Current conference, me and Gang had a talk about different tools and different coffee. You may check https://www.timeplus.com/post/query-kafka-with-sql-current23


> Materialize no longer provide the latest code as an open-source software that you can download and try. It turned from a single binary design to cloud-only micro-service

Materialize CTO here. Just wanted to clarify that Materialize has always been source available, not OSS. Since our initial release in 2020, we've been licensed under the Business Source License (BSL), like MariaDB and CockroachDB. Under the BSL, each release does eventually transition to Apache 2.0, four years after its initial release.

Our core codebase is absolutely still publicly available on GitHub [0], and our developer guide for building and running Materialize on your own machine is still public [1].

It is true that we substantially rearchitected Materialize in 2022 to be more "cloud-native". Our new cloud offering offers horizontal scalability and fault tolerance—our two most requested features in the single-binary days. I wouldn't call the new architecture a microservices design though! There are only 2-3 services, each quite substantial, in the new architecture (loosely: a compute service, an orchestration service, and, soon, a load balancing service).

We do push folks to sign up for a free trial of our hosted cloud offering [2] these days, rather than trying to start off by running things locally, as we generally want folks' first impressions of Materialize to be of the version that we support for production use cases. A all-in-one single machine Docker image does still exist, if you know where to look, but it's very much use-at-your-own-risk, and we don't recommend using it for anything serious, but it's there to support e.g. academic work that wants to evaluate Materialize's capabilities to incrementally maintain recursive SQL queries.

If folks have questions about Materialize, we've got a lively community Slack [3] where you can connect directly with our product and engineering teams.

[0]: https://github.com/MaterializeInc/materialize/tree/main

[1]: https://github.com/MaterializeInc/materialize/blob/main/doc/...

[2]: https://materialize.com/playground/

[3]: https://materialize.com/s/chat


Both Materialize and RisingWave are written in Rust and provide compatible SQL interface as Postgres. The user experience is more close to traditional database.

Proton is written in C++, adding streaming processing to ClickHouse and make stream the main concept. It is designed to support streaming analytics where aggregating large number of data over time is the focus.

In geneneral, these three are very similar.


Just tried Proton. Simple enough for me to set up. I get my data in quickly and wait no time to query. I am reading your docs and trying to write more complex stuff. PS For the JOIN doc, personally, I would like to see more visualizations to demonstrate the joining behavior(between changelog, and versioned).


Glad to hear that. In the Timeplus Cloud UI, there is a way to visualize the query plan/DAG. Sorry it's not available in command line yet.


I can't be the only that cringes when they see otherwise smart people suggesting a non-safe language as a Good Thing.


Curious as to how the code base evolved after forking from clickhouse


Proton is a lightweight streaming processing "add-on" for ClickHouse, and we are making these delta parts as standalone as possible. Meanwhile contributing back to the ClickHouse community can also help a lot.

Please check this PR from the proton team: https://github.com/ClickHouse/ClickHouse/pull/54870


Oh that’s cool. How do plugins/add ons work with clickhouse


table engines and need compile into the single binary instead of .so


Project-leader:

Team, what do you think building that with proton?

A:

Sorry, what? We cant do that with email!!

B:

I don't understand, if you want that to run on windows we should use windows, proton is mainly for games.

C:

Why are we talking about reverse osmosis now? Is that software not for a postal-service?

D:

Yeah ok, let us use AWS Proton for deployment, but that's far in the future.


Not to be confused with Apache Proton [0], a message bus tool.

[0] https://qpid.apache.org/proton/


Or Valve's Proton[0], a tool for playing Windows games on Linux.

[0]https://github.com/ValveSoftware/Proton


Yes, that's a cool project. I won't miss any Proton update on my Steam Deck, and sometimes we got signups from proton mail too


Or Proton [0], the company behind Proton Mail and other end-to-end encrypted productivity apps.

[0] https://proton.me/


Or proton [0], the subatomic particle with a positive electric charge

[0] https://en.wikipedia.org/wiki/Proton


Not to be confused with electron, an ephemeral computer program with no mass, yet described as slow and heavy by most people who know of it.


Reusing short English words is a big problem in computer science.


Thanks for letting us know. Technically it's not Apache Proton. The link is the Proton component in Apache Qpid. Maybe we should setup a list for awesome Proton(s)


When recycling names that already exist and have traction can we append some context to the end, i.e. proton-stream? I was expecting steams Proton.


Proton was started from a Timeplus's internal projects which all named after particle.

Yes, it lacks context, but can you get the context from Flink or Spark ? both of these are just data processing tools, supporting streaming as well.


How about "Timeplus Proton"? Yes, I know there are good game engine and mail service with that name. At Timeplus, component is named after the particles, e.g. https://github.com/timeplus-io/chameleon, our data generator




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: