It’s amazing we’re still finding serious perf issues in TCP stacks, some of the most battletested subsystems we have. It’s really a testament to how complicated flow control, congestion control etc is. There are so many little gotchas that can wreak havoc, especially in the face of shared resources.
Makes me worried about QUIC and how long it will take to build up the same level of rigor as TCP stacks, plus all the other fancy features. And who’s going to maintain std lib for every popular language? Heck many of them don’t even provide HTTP or recent TLS versions.
They’ve just started building a QUIC stack in Go (std) and the excellent community project quic-go is something like 8y in with 70kLOC (although mostly tests). And there are still important perf improvements missing.
Having TCP provided by the kernel is something I’ve come to appreciate more and more. Plus the kernel can divide up resources and schedule based on information that simply doesn’t exist in user-space.
Great insights, and I would add that it’s not just the internet consumer side where there is pressure on TCP evolution, it’s in the datacenter, too. [1]
I have recently tracked down a number of networking issues caused by cgroups introducing network stalls in certain scenarios. There were a number of patches submitted to upstream in v6.x to fix these issues. It is a constant pain seeing such fundamental networking issues still occurring in the latest linux kernel versions. On one hand I would like to track bleeding edge kernel releases to get the latest fixes and performance improvements, but on the other hand I do not have enough resources to maintain our own kernel builds and patches.
Do you have any references to specific bugs here? We depend pretty heavily on containers and I'd love to look into these and see if we are impacted and whether we should carry these patches
The one thing I was hoping to see but didn’t: how long has this been broken? Has it been broken since coalescing was added? Or was it a regression introduced by some other patch 2 years ago, or what?
They said they started noticing this recently. Is that due to its introduction? Or has something else changed (configuration, tuning, the traffic passing through) that is causing this far more than it used to?
Makes me worried about QUIC and how long it will take to build up the same level of rigor as TCP stacks, plus all the other fancy features. And who’s going to maintain std lib for every popular language? Heck many of them don’t even provide HTTP or recent TLS versions.
They’ve just started building a QUIC stack in Go (std) and the excellent community project quic-go is something like 8y in with 70kLOC (although mostly tests). And there are still important perf improvements missing.
Having TCP provided by the kernel is something I’ve come to appreciate more and more. Plus the kernel can divide up resources and schedule based on information that simply doesn’t exist in user-space.