Each allocation also needs an entry in the page table (actually a virtual memory...

masklinn · on Feb 2, 2021

> 100000 * (8MiB reserved, 1..4KiB used stack) needs ~784MiB RAM.

Yeah so nothing really relevant, that's 1/20th of a relatively basic dev laptop's memory for 100k threads.

And of course that's an insane worst-case scenario of 8MB stacks, which the linux devs picked because they wanted to add a limit to the stack but didn't really care for having one, Windows uses 1MB stacks and macOS uses 512k off-main stacks, so you don't need anywhere near 8MB to get C-compatible stack sizes.

labawi · on Feb 2, 2021

> It's only vmem though is the point

With 8M/4M/2M/1M/512k stack size, 4K page size, you get about 800M/800M/800M/600M/500M RAM usage, of which only 400M is usable, rest is overhead. At 16K page size, it's ~1600M. Compared to 200M if using 2K side-by-side stacks, in any configuration.

Yes, probably not too excessive, even though it is noticeable when you spawn theads for anything and everything, and run more than a single application on a non-SV-developer PC.

I think main problems start when you actually touch more than the base allocation of the stack (or just use 16K pages). Maybe segmented stacks grow from 200M to 300M, 500M or whatever you actually use at a given time (with say 20-60% efficiency), but your C-compatible stacks might go from 500M to 3G¹ if you on average touch just 32K of stack per thread with some unlucky function, even though at any given time only the same ~100M of memory actually stores useful information.

¹ or more, no idea how high it typically goes, but that in itself is a nasty gotcha, and a likely reason you won't find "lightweight" threading in combination with native per-thread stacks

masklinn · on Feb 3, 2021

> With 8M/4M/2M/1M/512k stack size, 4K page size, you get about 800M/800M/800M/600M/500M RAM usage

Simulating this (by creating 100k maps of the relevant sizes) there is no difference in RES between 8M pages and 512K pages: it was ~495M for both, only the VMEM varied (respectively ~780G and ~50G). Touching the second page increased the RES of both to ~816M, which is about what you'd expect.

This is on a more or less stock x64 Mint.

> I think main problems start when you actually touch more than the base allocation of the stack

The thing is you're unlikely to do that in all of your 100k routines, most of them will not grow beyond their first page, and maybe their second… at which point the routine's stack would have grown to 8k anyway.

> maybe segmented stacks grow from 200M to 300M, 500M or whatever you actually use at a given time (with say 20-60% efficiency)

Go has not used segmented stacks since 1.3.

> your C-compatible stacks might go from 500M to 3G¹ if you on average touch just 32K of stack per thread with some unlucky function

Your not-C-compatible stacks will do the exact same thing. Since 1.3, stacks are realloc'd and double in size on every overflow.

The Go runtime will also shrink stacks if able (halving them) during a GC run, but you can do essentially the same thing on your C stack using madvise(2), and without the need to copy stack data around.

So what gain there is, is really only for goroutines which never grow beyond their initial size (and only since the default stack was decreased from 8k to 2k in 1.4), at the cost of all the C incompatiblity mess. And it assumes these 2k stacks are allocated from a reusable pool (which is probably a fair assumption though I certainly did not check it) otherwise they'd be on different pages anyway.

labawi · on Feb 5, 2021

> Simulating this .. there is no difference in RES between 8M pages and 512K pages

Linux memory management has notoriously complicated reporting. Your program only has 495M of usable memory mapped (RSS; not sure where the extra 100M is coming from), but RSS does not count page tables.

You cannot actually use the 400M of sparsely allocated memory (4K at 2+MB intervals) without another 400M of page tables. I'd suggest you try allocating and using >50% of RAM, or just compare how much can you use sparsely vs. dense before your program hits OOM. Note, you may need to enable overcommit, increase maximum map count, if you are mapping region seperately and preferably disable swap to avoid thrashing.

  sysctl vm.overcommit_memory=1
  sysctl vm.max_map_count=10000000
  swapoff -a

You can also watch /proc/memory PageTables total, which should show the difference.

That being said, I don't use go, I was merely pointing out that virtual memory management is not magical, and it has real memory costs when doing sparse allocations (also has quite significant other costs).

> The Go runtime will also shrink stacks if able (halving them) during a GC run, but you can do essentially the same thing on your C stack using madvise(2), and without the need to copy stack data around.

While you could do that, I believe it requires a syscall per-stack, which might be more expensive than a bit of copying, and I'm not completely sure how easy it would be to determine whether the memory has in fact been allocated and needs freeing.