For dynamic binaries, we continue to to permit the main program exec segment because "go" (and potentially a few other applications) have embedded system calls in the main program. Hopefully at least go gets fixed soon.
We declare the concept of embedded syscalls a bad idea for numerous reasons, as we notice the ecosystem has many of static-syscall-in-base-binary which are dynamically linked against libraries which in turn use libc, which contains another set of syscall stubs. We've been concerned about adding even one additional syscall entry point... but go's approach tends to double the entry-point attack surface.
[edit for convenience of readers - read the above linked thread - I just grabbed the go part]
Unfortunately our current go build model hasn't followed solaris/macos approach yet of calling libc stubs, and uses the inappropriate "embed system calls directly" method, so for now we'll need to authorize the main program text as well. A comment in exec_elf.c explains this.
If go is adapted to call library-based system call stubs on OpenBSD as well, this problem will go away. There may be other environments creating raw system calls. I guess we'll need to find them as time goes by, and hope in time we can repair those also.
> We've been concerned about adding even one additional syscall entry point
I don't understand the need for such a severe "only libc syscalls ever" approach.
What would be the security concern with allowing syscalls only from preauthorized (ie msyscall(2)) regions, making initial region authorization opt-in (instead of opt-out), allowing the program to call msyscall(2) itself, and rejecting any statically linked (ie non-ASLR'd) regions for authorization?
> I don't understand the need for such a severe "only libc syscalls ever" approach.
There's nothing severe about it. Most systems are exactly that: systems of which the kernel is only one part, syscalls are rarely if ever intended to be called directly nilly-willy.
The issue is that unlike windows unices have never enforced this.
It makes sense for systems where libc is tightly coupled and coversioned with the kernel, e.g. BSDs, but Linux always relied on third-party C libraries and supported static binaries, etc.
You could argue that BSD made the mistake of intending to have a Windows-style C library compat guarantee but not enforcing it, but that was not in scope for Linux. The philosophy has always been syscall-level compat (and there are lots of famous threads with Linus re-enforcing this to others who would presume that things should be “fixed in user space”).
So it’s hardly reasonable to generalize based on some BSD concerns; Linux is WAI and represents the most common Unix-like system people use today by far.
There’s a pretty good argument that this level of compat, while the source of some problems, has also made other things much easier: consider container images that are bundled with their own system libraries. (You could certainly invent schemes to inject these libraries, but dealing with link and library level compatibility seems even more complex to deal with than system call-level compatibility.)
Darwin/macOS has the same rules as Windows and the BSDs--syscalls are private API--and it's extremely popular due to iOS. Linux is in fact the odd one out here.
There is a difference though: libSystem on Darwin is a very thin wrapper over the kernel syscalls; on the contrary, libc is a library that was
designed for C, then standardized in POSIX, and has several layer of abstraction over kernel syscalls including many bad defaults that are universally recognized as wrong today (eg: libc’s created file descriptors will all inherit by default).
Go isn't obligated to use any libc APIs or abstractions other than those providing syscalls.
You're incorrect or maybe just misleading about libc created file descriptors inheriting, as stated. Either way, it is unrelated to using libc for syscalls vs bare machine traps.
I think it's mandated by POSIX standard; but even if I'm wrong, there's still the problem that libc doesn't allow to do that atomically, for instance. In general, it's an old interface that doesn't fully expose the full power of all modern syscalls.
Yeah, libc's syscall wrappers just do what you tell them. If you don't pass O_CLOEXEC to the kernel syscalls, you get the inherit behavior. Libc's syscall wrappers don't change this in any way.
To the extent that Go's default for file descriptors today is !inherit (I'm unfamiliar, but if so, it's a good choice), the Go runtime must already add O_CLOEXEC to bare syscalls. There's no reason to believe it incapable of adding the flag to libc syscalls instead.
You are thinking of the older way, where fcntl(fd, F_SETFD, FD_CLOEXEC) must be used after open(), leaving a short window in which the file descriptor may be inherited.
The newer way passes the O_CLOEXEC to open() and there is no fcntl() call. This is atomic with respect to inheritability: The kernel returns a non-inheritable file descriptor to libc, and libc returns it to the application.
Other syscalls that return a file descriptor have similar flags, so they are atomic too.
These flags and behaviours are exactly the same, whether done by calling through libc as most programs do, or direct kernel syscalls bypassing libc, as Go and a few other programs do.
Unfortunately, you misunderstand how CLOEXEC works and how the Go runtime implements the feature you think libc lacks.
This syscall level behavior is POSIX-specified[1] since at least the 2008 edition[2]:
> O_CLOEXEC
> If set, the FD_CLOEXEC flag for the new file descriptor shall be set.
What that means is, any C program or Go program that passes the O_CLOEXEC flag to open(2) on a POSIX 2008 conforming system (including Linux and the BSDs, for example), will atomically create the fd without inherit behavior. There is no "short window" and hasn't been for more than a decade. The Go runtime must use that flag to provide that property; there is no other way on these systems. Libc users are of course able to use the same flag.
I mean, theoretically (i.e. I have no idea if anything does this), you could have underlying system file descriptors, which did not inherit, then a mapping from "libc" file descriptors onto OS ones and some code in fork() to copy over any OS file descriptors that are exposed to libc-using code.
The mapping is the identity function and whether a descriptor is inherit or not is just a function of O_CLOEXEC / fcntl(FD_CLOEXEC) / some special fd types are always cloexec, such as kqueues on FreeBSD. Libc fds aren't special to the operating system in any way.
The plan for containers on Solaris, after rejecting injecting libc from the host, was to have users rebuild all containers after OS upgrade. Windows has to virtualise containers with an incompatible OS version. It is definitely less convenient than Linux.
OpenBSD does have somewhat different constraints and they seem to think this will work for them.
The plan for containers on Solaris, after rejecting injecting libc from the host, was to have users rebuild all containers after OS upgrade.
What containers are you referring to? Because this is definitely not how Zones work on Solaris.
There are two types of Zones in Solaris 11; "Kernel Zones" which run their own independent version of Solaris and "non-global Zones" which are automatically kept at the same version as the host.
Windows has to virtualise containers with an incompatible OS version.
Not as far as I'm aware. Windows Sandboxes don't work that way nor do other technologies I'm aware of. What are you referring to?
The iOS Simulator does something along these lines for macOS. When macOS is updated CoreSimulator rebuilds the dyld_sim shared caches because simulator runtimes pull in libsystem_kernel, libsystem_pthread, and libsystem_platform (which cover the core of the kernel’s ABI).
> It makes sense for systems where libc is tightly coupled and coversioned with the kernel, e.g. BSDs, but Linux always relied on third-party C libraries and supported static binaries, etc.
Linux is not a system. Linux is a kernel, linux has distributions, it is not a single coherent system where the kernel and standard library are co-developed. That's the entire point I'm making.
> So it’s hardly reasonable to generalize based on some BSD concerns
It's not "some BSD concerns", it's pretty much every non-linux unix. What's not reasonable is generalising "syscalls are a perfectly fine interface" which is almost exclusively a Linux exclusivity. Or don't claim compatibility with anything other than linux, that's also a perfectly fine choice.
> There’s a pretty good argument that this level of compat, while the source of some problems, has also made other things much easier: consider container images that are bundled with their own system libraries.
Last time I checked, Go did not run exclusively on linux. If it did, raw syscalls would indeed not be a concern (though even then they try to have their cake and eat it, as e.g. they want to do raw syscalls yet benefit from vDSO, which has been an issue in the past because their assumptions did not hold: https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/)
Perhaps "strict" would have been a better word choice than "severe".
It does seem that there's a very reasonable security concern about doing so from +w+x memory, or from non-PIE regions, or without first explicitly authorizing the calling range.
My question still stands regarding what the security concern of opt-in PIE -w+x code making direct syscalls is.
Edit: (Of course I do understand that the BSDs (unlike Linux) do not guarantee a stable syscall ABI and as such performing them directly is strictly a bad practice.)
Windows did not enforce this either, and despite how bad of an idea it is, there have been software that do their own syscalls. Mostly tricky things like DRM or anticheat.
Still, I don’t think there’s anything wrong with letting an application mark part of its code safe for syscall execution, versus enforcing libc only. Seems like the exact same thing as the execute bit. Moreover, some systems genuinely have a stable syscall ABI - I think Linux would be considered one.
Windows pretty much enforces it in the sense that syscall numbers can change between minor updates, so raw syscalls breaks extremely often.
> Moreover, some systems genuinely have a stable syscall ABI - I think Linux would be considered one.
Linux is not a system. It's a kernel, with userlands you can bolt on. That's why it has a stable syscall ABI: that's the only interface Linux can have if it intends to provide an interface.
Linux is basically the only system that targets a stable syscall ABI, and that's basically due to being loosely coupled with all other parts of a Linux-based operating system.
> syscalls are rarely if ever intended to be called directly nilly-willy.
I don't think this is true, at least on Linux.
On Linux, the system call interface¹ is the documented interface to user space. Even the commonly used vDSO² is a stable interface. This is important because it means the popular C libraries are not part of the Linux kernel interface. Although glibc is often portrayed³ as some kind of Linux kernel wrapper, they are entirely separate projects. Linux manuals⁴ also make it seem like they are one and the same:
> The Linux man-pages project documents the Linux kernel and C library interfaces that are employed by user-space programs.
These same manuals also document systemd as if it was part of Linux. I went there expecting low level documentation useful for writing one's own init system and got systemd documentation instead. It's very confusing in my opinion. Why are external projects documented in the Linux manuals?
Anyway, these kernel features are used by C libraries to implement all their functions. Using C libraries is the traditional way to build a Linux user space but it is certainly not the only way. Compilers could emit these system calls directly, avoiding the need for a runtime library. A programming language virtual machine could be built directly on top of Linux system calls. It is possible to create freestanding programs that run on Linux with zero dependencies.
Incompatibilities are caused by these user space libraries, not by the system calls themselves. For example, glibc maintains a lot of thread local state and will not work correctly if the program calls clone(). A program that does not link to glibc does not have this limitation.
Although low level, Linux system calls are in many ways a simpler interface: their behavior is more precisely documented compared to POSIX; there is no need to deal with errno; there is no hidden C library functionality that's hard to understand; freestanding programs do not contain references to hundreds of hidden standard library symbols that implement obscure functionality.
The kernel itself containd a nolibc.h header[5] that it apparently uses for its own tools.
Correct. So, now, if you want to allow any program, and presumably its libraries, to register their regions of memory as being syscall-worthy you need to lift the called-once restrictions, which opens a hole for an attacker.
> you need to lift the called-once restrictions, which opens a hole for an attacker
I don't see how that's the case; my original question was essentially asking what that security hole is. Provided that syscall ability is opt-out by default and that only code subjected to ASLR is permitted to be authorized, it doesn't seem terribly risky to allow additional such regions to be registered. An exploit has to contend with ASLR either way; either by locating libc, or by locating some other authorized region within the current process.
ASLR isn't a complete solution. It's not that hard to find libc, so this is just another hurdle, not a full barrier. You're proposing weakening the barrier.
I'm not proposing anything, I'm asking for a concrete explanation of the supposed security hole. I agree that ASLR isn't a complete security solution and never implied otherwise.
AFAIU, the entire security benefit here is due to ASLR alone. If an exploit manages to track down libc, it can go right ahead and make all the system calls it wants. (Unless there's some other piece to the puzzle that I've missed? Is there something special about libc in particular?) As such, I still don't understand how the called-once restriction is supposed to meaningfully increase security - by the time you've found the msyscall() function, you've also found _all the others_ anyway.
> AFAIU, the entire security benefit here is due to ASLR alone. If an exploit manages to track down libc, it can go right ahead and make all the system calls it wants
It has to create the appropriate gadgets to generate function call sequences, and generating gadgets is hard.
That recent change in OpenBSD is indeed interesting, however, this doesn't have much to do with how go handles scheduling of goroutines, other than the fact that the words "go" or "syscall" appear in both places.
> Unfortunately our current go build model hasn't followed solaris/macos approach yet of calling libc stubs, and uses the inappropriate "embed system calls directly" method,
Go, as of version 1.12,uses libSystem in Darwin to make syscalls.
Does this mean that they want to ban syscalls from everything but approved "fat client" libraries like libc? (And perhaps ban versions of libc that have bugs?) How is that implemented? I guess it's by only allowing syscalls if the calling code is in a special part of memory, and the OS can gatekeep access to that memory?
My understanding is that system calls can currently only be made from -w+x regions; attempting otherwise results in the process being killed.
The idea is to extend this protection to only allow system calls from expected address ranges, so that a successful exploit can't simply make raw calls but instead has to track down an existing authorized one (and thus contend with ASLR). To that end, the new call-once syscall msyscall(2) is added. The linker uses it to register libc.so with the kernel after randomly mapping it into the current process.
For dynamic binaries, we continue to to permit the main program exec segment because "go" (and potentially a few other applications) have embedded system calls in the main program. Hopefully at least go gets fixed soon.
We declare the concept of embedded syscalls a bad idea for numerous reasons, as we notice the ecosystem has many of static-syscall-in-base-binary which are dynamically linked against libraries which in turn use libc, which contains another set of syscall stubs. We've been concerned about adding even one additional syscall entry point... but go's approach tends to double the entry-point attack surface.
https://marc.info/?l=openbsd-tech&m=157488907117170&w=2
[edit for convenience of readers - read the above linked thread - I just grabbed the go part]
Unfortunately our current go build model hasn't followed solaris/macos approach yet of calling libc stubs, and uses the inappropriate "embed system calls directly" method, so for now we'll need to authorize the main program text as well. A comment in exec_elf.c explains this.
If go is adapted to call library-based system call stubs on OpenBSD as well, this problem will go away. There may be other environments creating raw system calls. I guess we'll need to find them as time goes by, and hope in time we can repair those also.
[/edit]