I understand that but if they're going to go down that route for OpenBSD why not bring it to all platforms? While I can believe that going the raw syscall route can be easier than dealing with libc idiosyncrasies, it seems to me that maintaining both interfaces depending on the target OS would be trickier than just cutting your losses and going the libc route everywhere.
After all as far I can tell it's not just an OpenBSD problem since famously they got breakage in MacOS as well.
I must admit that I haven't taken the time to analyze in depth the pros and cons here, but Go's history of NIH coupled with the fact that basically every other mainstream language manages to work by binding the libc leaves me very perplex.
In particular some of the points raised in the blogpost I linked seem fishy to me. For instance errno being a global: this is in no way a Go-specific problem (multithreaded C couldn't run concurrent syscalls if it was true). In practice errno is thread-local instead of being a true global. It's explicitly documented in the man page:
> errno is defined by the ISO C standard to be a modifiable lvalue of type int, and must not be explicitly declared; errno may be a macro. errno is thread-local; setting it in one thread does not affect its value in any other thread.
I can believe that there are other issues I fail to consider, but again it works for everybody else, what makes Go so special here?
> I understand that but if they're going to go down that route for OpenBSD why not bring it to all platforms?
Because they really don’t want to, so they will avoid doing it until forced to, as previously happened with e.g. macOS.
> I must admit that I haven't taken the time to analyze in depth the pros and cons here, but Go's history of NIH coupled with the fact that basically every other mainstream language manages to work by binding the libc leaves me very perplex.
One of the issues Go has is it uses its own non-C stack. Libcs generally assume a C stack and don’t understand growable movable stacks so they can get very cross when called with an unexpectedly small stack (iirc Go defaults to 2k while the smallest C stack I know of is 64k on old OpenBSD, then macOS’s 512k for the non-main threads).
Obviously I'm missing something here, because I was under the impression that, regardless of how big the stack is, the pages are going to be mapped (in the data structure pointed to by the CR3 register) when accessed and no earlier. Unless the allocation of the stack makes sure that it is mapped in its entirety upfront and stack access cannot incur minor page faults, which may not be so outlandish come to think of it. Can you please provide some clarification here? Thanks!
I'm not sure what clarification I could provide given I don't know what you're missing.
You're correct (AFAIK anyway) that stack pages are mapped lazily. That doesn't change the fact that C stacks are large allocations, while Go will only allocate a very small stack (2k last I checked) per goroutine.
Goroutines don't map directly to OS threads though, do they? So in practice it only matters for "hard" threads. I have no idea about what the Go scheduler looks like though, so I can't say if there's a direct relation between the two (coroutine stack vs. thread stack).
Also the default pthread stack size is (usually) 2MB which is indeed non negligible if you have a ton of threads, but with pthread_attr_setstacksize you can lower it. I can't seem to find the actual minimal size at the moment but I have a vague memory that you can reduce it to 16kB portably.
I'm currently working on a multithreaded Rust program with a bunch of threads and strong memory constraints and I use the runtime to reduce the stack to 64kB, it seems to work just fine.
And again, given that the language is garbage collected I can't help but find it a bit amusing that they're being stingy with a few MB of VMEM per thread. I guess that frees virtual memory for use in the balasts!
> Goroutines don't map directly to OS threads though, do they? So in practice it only matters for "hard" threads.
It matters for goroutines because that means you can't straight call into foreign code from a goroutine's stack, which is why cgo and friends are so problematic.
> Also the default pthread stack size is (usually) 2MB which is indeed non negligible if you have a ton of threads
That's not relevant because it's not vmem is the point. The actual resident size of a 2MB stack is 4k unless you start dirtying more pages. Unless you're running on 32b, VMEM doesn't really matter.
Yes, sorry, let me be more clear, I don't understand why Go should allocate a small 2k stack (you are correct, the Go stack size has fluctuated through the years, nowadays it's 2k) and if it grows go to the trouble of copying it to a larger structure? Instead of just allocating a big stack that will grow lazily anyway?
Obviously the folks that created Go aren't stupid, far from it, so there must be a real and valid reason. But can't easily imagine what it is.
It's only vmem though is the point. It doesn't really matter that you have 100000 stacks of 8MB when all of that is vmem and you've got a page committed on each.
Each allocation also needs an entry in the page table (actually a virtual memory tree). On x86/amd64 the fanout is about 1:512.
If you have 8MiB stacks, then a minimally allocated stack uses a 4KiB data page, but also 2MiB of address range uses up a full 4KiB bottom level page table, and 8MiB range takes up 4/512 x 4KB of 2nd level page table and so on. So you use about 8.03 KiB RAM if you never touch more than 4KB and your 8MB reservations are mostly grouped. Some architectures have bigger pages, increasing fanout but also the minimum allocation.
Contrast to 2KiB stacks without reservations/overcommit - you use about 2KiB of usable RAM + (1/2) x (1/512) of 4KiB 1st level page table + .. , assuming allocation are again mostly grouped. Hence, for up to 2KiB of stack memory you need about¹ 2.005 KiB of RAM. Works the same for 16 KiB and even 64 KiB page sizes.
100000 * (8MiB reserved, 1..4KiB used stack) needs ~784MiB RAM.
100000 * (2KiB reserved, 1..2KiB used stack) needs ~200MiB RAM.
Note that, if you actually touch your reserved stack, even once, your allocation can balloon to possibly tens or hundreds of GiB (100k * 8MiB = ~800GiB), unless you do complicated cleanup, while a segmented stack can keep the allocations within reasonable efficiency, freeing any excessive stack allocations in userspace.
¹ ignoring bookkeeping overhead in both cases, to keep the calculation clear. Hopefully it isn't more than a dozen bytes or so.
Yeah so nothing really relevant, that's 1/20th of a relatively basic dev laptop's memory for 100k threads.
And of course that's an insane worst-case scenario of 8MB stacks, which the linux devs picked because they wanted to add a limit to the stack but didn't really care for having one, Windows uses 1MB stacks and macOS uses 512k off-main stacks, so you don't need anywhere near 8MB to get C-compatible stack sizes.
With 8M/4M/2M/1M/512k stack size, 4K page size, you get about 800M/800M/800M/600M/500M RAM usage, of which only 400M is usable, rest is overhead. At 16K page size, it's ~1600M. Compared to 200M if using 2K side-by-side stacks, in any configuration.
Yes, probably not too excessive, even though it is noticeable when you spawn theads for anything and everything, and run more than a single application on a non-SV-developer PC.
I think main problems start when you actually touch more than the base allocation of the stack (or just use 16K pages). Maybe segmented stacks grow from 200M to 300M, 500M or whatever you actually use at a given time (with say 20-60% efficiency), but your C-compatible stacks might go from 500M to 3G¹ if you on average touch just 32K of stack per thread with some unlucky function, even though at any given time only the same ~100M of memory actually stores useful information.
¹ or more, no idea how high it typically goes, but that in itself is a nasty gotcha, and a likely reason you won't find "lightweight" threading in combination with native per-thread stacks
> With 8M/4M/2M/1M/512k stack size, 4K page size, you get about 800M/800M/800M/600M/500M RAM usage
Simulating this (by creating 100k maps of the relevant sizes) there is no difference in RES between 8M pages and 512K pages: it was ~495M for both, only the VMEM varied (respectively ~780G and ~50G). Touching the second page increased the RES of both to ~816M, which is about what you'd expect.
This is on a more or less stock x64 Mint.
> I think main problems start when you actually touch more than the base allocation of the stack
The thing is you're unlikely to do that in all of your 100k routines, most of them will not grow beyond their first page, and maybe their second… at which point the routine's stack would have grown to 8k anyway.
> maybe segmented stacks grow from 200M to 300M, 500M or whatever you actually use at a given time (with say 20-60% efficiency)
Go has not used segmented stacks since 1.3.
> your C-compatible stacks might go from 500M to 3G¹ if you on average touch just 32K of stack per thread with some unlucky function
Your not-C-compatible stacks will do the exact same thing. Since 1.3, stacks are realloc'd and double in size on every overflow.
The Go runtime will also shrink stacks if able (halving them) during a GC run, but you can do essentially the same thing on your C stack using madvise(2), and without the need to copy stack data around.
So what gain there is, is really only for goroutines which never grow beyond their initial size (and only since the default stack was decreased from 8k to 2k in 1.4), at the cost of all the C incompatiblity mess. And it assumes these 2k stacks are allocated from a reusable pool (which is probably a fair assumption though I certainly did not check it) otherwise they'd be on different pages anyway.
> Simulating this .. there is no difference in RES between 8M pages and 512K pages
Linux memory management has notoriously complicated reporting. Your program only has 495M of usable memory mapped (RSS; not sure where the extra 100M is coming from), but RSS does not count page tables.
You cannot actually use the 400M of sparsely allocated memory (4K at 2+MB intervals) without another 400M of page tables. I'd suggest you try allocating and using >50% of RAM, or just compare how much can you use sparsely vs. dense before your program hits OOM. Note, you may need to enable overcommit, increase maximum map count, if you are mapping region seperately and preferably disable swap to avoid thrashing.
sysctl vm.overcommit_memory=1
sysctl vm.max_map_count=10000000
swapoff -a
You can also watch /proc/memory PageTables total, which should show the difference.
That being said, I don't use go, I was merely pointing out that virtual memory management is not magical, and it has real memory costs when doing sparse allocations (also has quite significant other costs).
> The Go runtime will also shrink stacks if able (halving them) during a GC run, but you can do essentially the same thing on your C stack using madvise(2), and without the need to copy stack data around.
While you could do that, I believe it requires a syscall per-stack, which might be more expensive than a bit of copying, and I'm not completely sure how easy it would be to determine whether the memory has in fact been allocated and needs freeing.
> Such large allocations can fail, even if virtual, depending on over commit settings.
Given how Go has no issue being prescriptive as hell on other things, I don't see why they couldn't just go "set vm. overcommit_memory to 1 and fuck off", that's exactly what e.g. Redis tells you.
You aren't thinking with big numbers. Small amounts of admin work add up. Virtual pages aren't free. Go has had to think very hard about what to pare down to avoid dragging to a halt with orders of magnitude fewer threads.
Not all platforms have a "libc". Linux is an example: there is no standard c library and while glibc is very common, there are several distributions that use other C libraries like uClibc and musl. For example a gaming handheld i have which is running Linux uses uClibc
Another example is Windows, the platform API does not provide a C library (even MSVC has its own). While there is an MSVCRT.DLL it is not recommended to link against it as it is there only because some other software relies on it and its semantics are for around Visual C++ 4 (IIRC).
It is already next to impossible to write software that requires “Linux” and nothing more with all the kernel functionality that can be enabled or disabled.
Linux is a component of many different platforms, which indeed do provide different libcs, but also different t.l.s. libraries, different c.p.u. architectures, different Linux configurations and whatever else.
As far as I know with respect to Windows, it only provides stable interfaces viāC libraries, and does not have a stable binary interface to the kernel directly.
For containers Go static binaries are great. Used to build Docker containers for Kubernetes using a 8 MB Go binary. That was all you needed. No libraries. No "minimum OS' Alpine or Ubuntu image.
> I can believe that there are other issues I fail to consider, but again it works for everybody else, what makes Go so special here?
errno being thread-local doesn't really help all that much with a M:N threading model -- the runtime is going to have to be extremely careful to not stomp all over it. (Whenever calling into anything which uses errno. Of course system calls do that as well, but it's a much smaller surface area than "most of libc".)
After all as far I can tell it's not just an OpenBSD problem since famously they got breakage in MacOS as well.
I must admit that I haven't taken the time to analyze in depth the pros and cons here, but Go's history of NIH coupled with the fact that basically every other mainstream language manages to work by binding the libc leaves me very perplex.
In particular some of the points raised in the blogpost I linked seem fishy to me. For instance errno being a global: this is in no way a Go-specific problem (multithreaded C couldn't run concurrent syscalls if it was true). In practice errno is thread-local instead of being a true global. It's explicitly documented in the man page:
> errno is defined by the ISO C standard to be a modifiable lvalue of type int, and must not be explicitly declared; errno may be a macro. errno is thread-local; setting it in one thread does not affect its value in any other thread.
I can believe that there are other issues I fail to consider, but again it works for everybody else, what makes Go so special here?