More

agallego · on May 15, 2023

our real storage is s3 - local disk is for staging/raft layer. how is that not cloud native. if you are referring to cloud native as k8s it is true that our k8s operator was built mostly for our cloud but we released it... the good news is a new interface (same code) w/ more friendly user-defaults is about to get released. you can track it all on github tho.

llama052 · on May 15, 2023

Oh I see the operator now, looks decent. Before the deployment documentation I had found was very manual and full of a lot of pod exec commands.

Worked with many operators in the wild and anything that gives you more control through CRD/automation and less manual pod intervention is a huge win, let's us bake into our already existing pipelines for deployment and releases also. The Confluent($$$$)/Strimzi operators do well on that front. I'm super excited to have competition in this space!

I'll keep an eye out for the new release!

agallego · on May 15, 2023

totally. we built a new team focused on the dev experience of k8s alone. 90seconds to prod (on a working eks cluster) with TLS, external certs, etc. That's the benchmark we're trying to hit :)

agallego · on May 15, 2023

When you add compaction, indexing, recovery, tiered storage, etc. some things become harder to reason about wrt systems resources if you are embedded.

agallego · on May 15, 2023

For long term storage I agree too. The reason we invented our byoc was so that (1) you own your storage and (2) we only charge you for value add

agallego · on May 15, 2023

We’re about to released a revamped wasm and new sdk with prev lessons learned. Should be cool

alexisread · on May 15, 2023

Any sign of JSON schema in the registry? That would be great if so!

agallego · on May 15, 2023

alex here, original author of redpanda

is hard to respond to a 6-part blog series content - released all at once - on an HN thread.

- what we can deterministically show is data loss on apache kafka with no fsync() [shouldn't be a surprise to anyone] - stay tuned for an update here.

- the kafka partition model of one segment per partition could be optimized in both arch

- the benefit for all of us, is that all of these things will be committed to the OMB (open messaging benchmark) and will be on git for anyone interested in running it themselves.

- we welcome all confluent customers (since the post is from the field cto office) to benchmark against us and choose the best platform. this is how engineering is done. In fact, we will help you run it for you at no cost. Your hardware, your workload head-to-head. We'll help you set it up with both.... but let's keep the rest of the thread technical.

- log.flush.interval.messages=1 - this is something we've taken a stance a long long time ago in 2019. As someone who has personally talked to hundreds of enterprises to date, most workloads in the world should err on the side of safety and flushing to disk (fsync()). Hardware is very good today and you no longer have to choose between safety and reasonable performance. This isn't the high latency you used to see on spinning disks.

jvanlightly · on May 15, 2023

It's a common misconception about Kafka and fsyncs. But the Kafka replication protocol has a recovery mechanism, much in the same way that Viewstamped Replication Revisited does (except it's safer due to the page cache), which allows Kafka to write to disk asynchronously. The trade-off is that we need fault domains (AZs in the cloud), but if we care about durability and availability, we should be deploying across AZs anyway. We've seen plenty of full region outages, but zero power loss events in multiple AZs in six years.

Kafka and fsyncs: https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...

ocadaruma · on May 16, 2023

As far as read the blog post, I understand that it assumes the scenario that "a replica dies (and loses its log prefix due to no fcyns) and came back instantaneously (before another replica catches up to the leader)".

Then, in Kafka, what if the leader dies with power failure and came back instantaneously?

i.e.: Let's say there are 3 replicas A(L), B(F), C(F) (L = leader, F = follower)

- 1) append a message to A

- 2) B, C replicas the message. The message is committed

- 3) A dies and came back instantaneously before zk.session.timeout elapsed (i.e. no leadership failover happens), with losing its log prefix due to no fsync

Then B, C truncates the log and the committed message could be lost? Or is there any additional safety mechanism for this scenario?

galeaspablo · on May 16, 2023

I love this question. Would be great to hear back from Confluent about this.

One safety mechanism I can think of is that the replicas will detect the leader is down and trigger leader election themselves. Or that upon restart the leader realized it restarted and triggers leader election in a way that B ends up as the leader. (not sure either is being done)

As I think about it more, even if there’s a solution I think I’ll stick to running Redpanda or running Kafka with fsync.

rdtsc · on May 16, 2023

The solution seems to be fsync. It’s what it’s for. It’s very appealing to wave it away because it’s expensive.

The situation above may be just one example of data loss, but it seems there could be others when we gamble on hoping servers restart quickly enough, and don’t crash at the same time, etc.

galeaspablo · on May 15, 2023

Two comments here.

1) What about Kafka + KRaft, doesn't that suffer the same problem you point out in Redpanda? If so, recommending to your customers to run KRaft without fsync would be like recommending running with a Zookeeper that sometimes doesn't work. Or do I fundamentally misunderstand KRaft?

2) You mention simultaneous power failures deep into the fsync blog post. I think this should be more visible in your benchmark blog post, when you write about turnning off fsyncs.

ocadaruma · on May 16, 2023

1) KRaft is only for metadata replication, and data replication is done in ISR based even in KRaft, so it doesn't change the conclusion

galeaspablo · on May 16, 2023

Nah, I dug deeper into this. Right conclusion, wrong reasoning.

The reason KRaft turns out to be fine is because the KRaft topic does fsync! Source: https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A...

KRaft is used for metadata replication in the same way that Zookeeper is used for metadata. I.e., in a very meaningful way.

agallego · on May 15, 2023

repeating things does not make them true. I read the post. You can only control some failures, but happy for us to write our thoughts in blog form.

Spivak · on May 15, 2023

> what we can deterministically show is data loss on apache kafka with no fsync() [shouldn't be a surprise to anyone] - stay tuned for an update here.

Confluent themselves can show this, the part I'm curious about is whether you can show data loss outside of the known documented failure modes. Because I, as any can anyone, show data loss by running a cluster without fsync and simultaneously pulling the plug on every server.

insanitybit · on May 15, 2023

> Because I, as any can anyone, show data loss by running a cluster without fsync and simultaneously pulling the plug on every server.

Woah, yeah that's a serious problem. Data loss under that scenario is nothing to sneeze at.

spenczar5 · on May 15, 2023

Then enable fsync. I don't really see a way around requiring synchronization to persistent disk if you want persistence cross power outages, right?

insanitybit · on May 16, 2023

Yes, that is what the linked benchmarks discuss...

threeseed · on May 15, 2023

It's not a serious problem for most deployments though.

You should be running Kafka in multiple DCs/AZs for high availability and scalability.

And in that scenario fsync is nice but not necessary.

Spivak · on May 15, 2023

I suppose but that's the trade-off for performance. You have to design your system so that can't happen. Which if you're cloud then it's deploying multi-az, if you're coloing then paying for racks with separate power and/or having battery so you have time to fsync and shut down and if you're fully on-prem then you don't need my advice.

Or I suppose just pay for a managed service from someone who does that for you.

datadeft · on May 15, 2023

I am not entirely sure what is the reason to make Kafka transactional. The original goal was to have a message queue that holds statistical data where the data loss cannot significantly alter the outcome of the analytics performed on the (often incomplete) data. Why are in this argument about fsync and such now? Did something change?

If you need reliable data storage do not use Kafka or similar technologies.

ako · on May 15, 2023

Do you have a source/link for that original goal? I wasn’t aware of this, and as such expect that I can rely on kafka for my events. Also, if this is really the case it should be mentioned on the homepage of Kafka.

Just checked kafka’s homepage, it mentions mission critical, durable, fault tolerant, stores data safely, zero message loss, trusted… Seems they’ve moved on from their original goal.

datadeft · on May 15, 2023

"The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type."

https://kafka.apache.org/

skrtskrt · on May 15, 2023

Kafka is used widely as a persistent event store, and its development features reflect that.

Why would I not just turn on fsync or deploy in a distributed pattern for reliability so I can just continue using it instead of ripping it out, benchmarking something new, teaching the entire org something new, potentially negotiating a new contract, and then executing a huge migration?

datadeft · on May 15, 2023

Just like heroin is widely used a recreational drug. We live in a free world and you can use Kafka as a persistent reliable store, even use it transactionally.

Instead of reading the marketing claims I like to read what @aphyr has to say about data storage systems.

https://aphyr.com/posts/293-call-me-maybe-kafka

relay23 · on May 15, 2023

Also https://jepsen.io/analyses/redpanda-21.10.1

relay23 · on May 15, 2023

Are you sure performance would be acceptable if you just turned on fsync on every message?

skrtskrt · on May 15, 2023

well it obviously depends on your usage patterns.

But at a certain point any technology is going to reach the limits of what current hardware and operating system primitives can do.

fsync vs. distributed consensus vs. other tradeoffs w.r.t reliability and consistency are not inherent to "Kafka or similar technologies". It's inherent to anything that runs on a computer in the real world.

Generally unless your scale is mind-bogglingly big, the ROI on tuning what you already have is going to be way way bigger than just ripping it out because you read a benchmarking article.

tyingq · on May 15, 2023

>The original goal was to have a message queue that holds statistical data

I suppose that might have been the original goal, but the current tag line includes "data integration" and "mission-critical".

"Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications."

datadeft · on May 15, 2023

I guess you can add any feature to anything. I think this whole investor driven development is just sad.

agallego · on May 16, 2023

Following up - https://redpanda.com/blog/why-fsync-is-needed-for-data-safet...

Try this on your laptop see global data loss - hint: multi-az is not enough

Regardless of the replication mechanism you must fsync() your data to prevent global data loss in non-Byzantine protocols.

dangoodmanUT · on May 15, 2023

Can you turn fsync off and rely on recovery with Redpanda?

agallego · on May 15, 2023

no, because it is built into the raft protocol itself. with Acks=-1 we only acknowledge to the producer once data has

1. writen to majority 2. majority has done an fsync()

i can see in the future giving people opt-out options here tho.

agallego · on Oct 18, 2022

I tried in 2017 writing it in rust and found some compiler bugs. I also found compiler bugs in c++ tho to be honest, but I felt more comfortable in c++ so decided to write the first version of it in c++. The huge advantage is that storage engines in particular need to be more conservative in many dimensions and having seen success with scylla, seastar was apealing to me as a 'tried and tested' for storage systems.

Prior systems I had built with facebook folly (c++ lib) and had also written my own eventing systems in the past, but the real value is having seastar being battle tested since 2016. Largely it has been the right decision for us as redpanda for it's young age has benefited from the stability of seastar.

agallego · on Aug 25, 2022

this is a bad take. people work on problems they find interesting. it follows that you write about them because you can. the author has worked on kernel hypervisors, databases and now doing stuff around code generation. btw their code is apache 2 and on github, what they sell is the hosted version.

agallego · on Aug 12, 2022

we use it for redpanda https://github.com/redpanda-data/redpanda - other langs are go for k8s integration, python (dev prod tooling), ... js for ui that's mostly it.

agallego · on July 15, 2022

I think this is a good architecture to focus on the developer experience. A rust runtime with a Python front end. Very cool to see

agallego · on May 5, 2022

I’m not sure this is true. I’ve probably spoken with 500+ teams myself and by and large folks use default settings.