Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is it throughput and latency that are the etcd bottlenecks? Our database, RonDB, is an in-memory open-source database (a fork of MySQL Cluster). We have scaled it to 100m reads/sec on AWS hardware (not even top of the line). Might be an interesting project to implement an open-source etcd shim on top of it?

Reference: https://www.rondb.com/post/100m-key-lookups-sec-with-rest-ap...



See https://github.com/k3s-io/kine, k3s uses this to shim etcd to MySQL, Postgres and sqlite


The setting is configurable, but by default, etcd's Raft implementation requires a voting node to write to disk before it makes a vote, as in actually flushing to disk, not just writing to the file cache. Since you need a majority vote before a client can get a response, this is why it's strongly recommended you use the fastest possible disks, keep the nodes geographically close to each other, and etcd's default storage is only 2GB per node.

All in all, it was a poor choice for Kubernetes to use this as its backend in the first place. Apparently, Google uses its own shim, but there is also kine, which was created a long time ago for k3s and allows you to use a RDBMS. k3s used sqlite as its default originally, but any API equivalent database would work.

We should keep in mind etcd was meant to literally be the distributed /etc directory for CoreOS, something you would read from often but perform very few writes to. It's a configuration store. Kubernetes deciding to also use it for /var was never a great idea.


RonDB uses a non-blocking 2PC algorithm - commits in memory, and then does a group commit of transactions to disk every 500ms. This means it can handle insane write throughput, as well as read throughput. However, if both your DB nodes fail, you could lose 500ms of data - which is not the end of the world for k8s. Normally, you would locate DB nodes in different AZes, reducing the probabilty of correlated failures.


At that point it is apples to oranges. One of the main reasons why etcd writes are slow is because they are guaranteed to be durably persisted across the quorum.

If you just turned off file system syncs in etcd you could probably get an order of magnitude better performance as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: