Stack buffers are allocated in the heap (by the "default" rr.c implementation of usched::s_alloc()). Allocating them on the stack, as you seem to be proposing, is a really clever idea: you can do something like rr_thread_init(&startfunc, arg, alloca(BUFFERSIZE)). The buffer is freed when the coroutine is terminated (rr_done()).
Thanks, it would also be good to have some descriptions of what the numbers are? What do you do about saving the registers at yield points? Maybe this is most easily done with compiler support.
I had been envisioning copying the stack images to a memory array rather than using alloca, though maybe alloca would also work. I see in your benchmarks you tested this on a quite large system. I had imagined something much smaller, like an Arduino. I might try porting your benchmark to GHC, Elixir, and maybe even Forth.
A 3-lisp program is conceptually executed by an interpreter written in 3-lisp that is itself executed by an interpreter written in 3-lisp and so on ad infinitum. This forms an infinite tower of meta-circular interpreters, which can be implemented efficiently.