Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Each allocation also needs an entry in the page table (actually a virtual memory tree). On x86/amd64 the fanout is about 1:512.

If you have 8MiB stacks, then a minimally allocated stack uses a 4KiB data page, but also 2MiB of address range uses up a full 4KiB bottom level page table, and 8MiB range takes up 4/512 x 4KB of 2nd level page table and so on. So you use about 8.03 KiB RAM if you never touch more than 4KB and your 8MB reservations are mostly grouped. Some architectures have bigger pages, increasing fanout but also the minimum allocation.

Contrast to 2KiB stacks without reservations/overcommit - you use about 2KiB of usable RAM + (1/2) x (1/512) of 4KiB 1st level page table + .. , assuming allocation are again mostly grouped. Hence, for up to 2KiB of stack memory you need about¹ 2.005 KiB of RAM. Works the same for 16 KiB and even 64 KiB page sizes.

100000 * (8MiB reserved, 1..4KiB used stack) needs ~784MiB RAM.

100000 * (2KiB reserved, 1..2KiB used stack) needs ~200MiB RAM.

Note that, if you actually touch your reserved stack, even once, your allocation can balloon to possibly tens or hundreds of GiB (100k * 8MiB = ~800GiB), unless you do complicated cleanup, while a segmented stack can keep the allocations within reasonable efficiency, freeing any excessive stack allocations in userspace.

¹ ignoring bookkeeping overhead in both cases, to keep the calculation clear. Hopefully it isn't more than a dozen bytes or so.



> 100000 * (8MiB reserved, 1..4KiB used stack) needs ~784MiB RAM.

Yeah so nothing really relevant, that's 1/20th of a relatively basic dev laptop's memory for 100k threads.

And of course that's an insane worst-case scenario of 8MB stacks, which the linux devs picked because they wanted to add a limit to the stack but didn't really care for having one, Windows uses 1MB stacks and macOS uses 512k off-main stacks, so you don't need anywhere near 8MB to get C-compatible stack sizes.


> It's only vmem though is the point

With 8M/4M/2M/1M/512k stack size, 4K page size, you get about 800M/800M/800M/600M/500M RAM usage, of which only 400M is usable, rest is overhead. At 16K page size, it's ~1600M. Compared to 200M if using 2K side-by-side stacks, in any configuration.

Yes, probably not too excessive, even though it is noticeable when you spawn theads for anything and everything, and run more than a single application on a non-SV-developer PC.

I think main problems start when you actually touch more than the base allocation of the stack (or just use 16K pages). Maybe segmented stacks grow from 200M to 300M, 500M or whatever you actually use at a given time (with say 20-60% efficiency), but your C-compatible stacks might go from 500M to 3G¹ if you on average touch just 32K of stack per thread with some unlucky function, even though at any given time only the same ~100M of memory actually stores useful information.

¹ or more, no idea how high it typically goes, but that in itself is a nasty gotcha, and a likely reason you won't find "lightweight" threading in combination with native per-thread stacks


> With 8M/4M/2M/1M/512k stack size, 4K page size, you get about 800M/800M/800M/600M/500M RAM usage

Simulating this (by creating 100k maps of the relevant sizes) there is no difference in RES between 8M pages and 512K pages: it was ~495M for both, only the VMEM varied (respectively ~780G and ~50G). Touching the second page increased the RES of both to ~816M, which is about what you'd expect.

This is on a more or less stock x64 Mint.

> I think main problems start when you actually touch more than the base allocation of the stack

The thing is you're unlikely to do that in all of your 100k routines, most of them will not grow beyond their first page, and maybe their second… at which point the routine's stack would have grown to 8k anyway.

> maybe segmented stacks grow from 200M to 300M, 500M or whatever you actually use at a given time (with say 20-60% efficiency)

Go has not used segmented stacks since 1.3.

> your C-compatible stacks might go from 500M to 3G¹ if you on average touch just 32K of stack per thread with some unlucky function

Your not-C-compatible stacks will do the exact same thing. Since 1.3, stacks are realloc'd and double in size on every overflow.

The Go runtime will also shrink stacks if able (halving them) during a GC run, but you can do essentially the same thing on your C stack using madvise(2), and without the need to copy stack data around.

So what gain there is, is really only for goroutines which never grow beyond their initial size (and only since the default stack was decreased from 8k to 2k in 1.4), at the cost of all the C incompatiblity mess. And it assumes these 2k stacks are allocated from a reusable pool (which is probably a fair assumption though I certainly did not check it) otherwise they'd be on different pages anyway.


> Simulating this .. there is no difference in RES between 8M pages and 512K pages

Linux memory management has notoriously complicated reporting. Your program only has 495M of usable memory mapped (RSS; not sure where the extra 100M is coming from), but RSS does not count page tables.

You cannot actually use the 400M of sparsely allocated memory (4K at 2+MB intervals) without another 400M of page tables. I'd suggest you try allocating and using >50% of RAM, or just compare how much can you use sparsely vs. dense before your program hits OOM. Note, you may need to enable overcommit, increase maximum map count, if you are mapping region seperately and preferably disable swap to avoid thrashing.

  sysctl vm.overcommit_memory=1
  sysctl vm.max_map_count=10000000
  swapoff -a
You can also watch /proc/memory PageTables total, which should show the difference.

That being said, I don't use go, I was merely pointing out that virtual memory management is not magical, and it has real memory costs when doing sparse allocations (also has quite significant other costs).

> The Go runtime will also shrink stacks if able (halving them) during a GC run, but you can do essentially the same thing on your C stack using madvise(2), and without the need to copy stack data around.

While you could do that, I believe it requires a syscall per-stack, which might be more expensive than a bit of copying, and I'm not completely sure how easy it would be to determine whether the memory has in fact been allocated and needs freeing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: