As explained in eigenstate.org and in SECCOMP (2):
The only system calls that the calling thread is permitted to
make are read(2), write(2), _exit(2) (but not exit_group(2)),
and sigreturn(2). Other system calls result in the delivery
of a SIGKILL signal.
As a result, one would expect _exit()
to work, but it's a wrapper function that invokes exit_group(2)
which is not allowed in strict mode ([1], [2]), thus the process gets killed.
It's even reported in exit(2) - Linux man page:
In glibc up to version 2.3, the _exit() wrapper function invoked the kernel system call of the same name. Since glibc 2.3, the wrapper function invokes exit_group(2), in order to terminate all of the threads in a process.
Same happens with the return
statement, which should end up in killing your process, in the very similar manner with _exit()
.
Stracing the process will provide further confirmation (to allow this to show up, you have to not set PR_SET_SECCOMP; just comment prctl()
) and I got similar output for both non-working cases:
linux12:/home/users/grad1459>gcc seccomp.c -o seccomp
linux12:/home/users/grad1459>strace ./seccomp
execve("./seccomp", ["./seccomp"], [/* 24 vars */]) = 0
brk(0) = 0x8784000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb775f000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=97472, ...}) = 0
mmap2(NULL, 97472, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7747000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "177ELF1113312202261004"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1730024, ...}) = 0
mmap2(NULL, 1739484, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xdd0000
mmap2(0xf73000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a3) = 0xf73000
mmap2(0xf76000, 10972, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xf76000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7746000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7746900, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xf73000, 8192, PROT_READ) = 0
mprotect(0x8049000, 4096, PROT_READ) = 0
mprotect(0x16e000, 4096, PROT_READ) = 0
munmap(0xb7747000, 97472) = 0
exit_group(0) = ?
linux12:/home/users/grad1459>
As you can see, exit_group()
is called, explaining everything!
Now as you correctly stated, "SYS_exit equals __NR_exit
"; for example it's defined in mit.syscall.h:
#define SYS_exit __NR_exit
so the last two calls are equivalent, i.e. you can use the one you like, and the output should be this:
linux12:/home/users/grad1459>gcc seccomp.c -o seccomp && ./seccomp ; echo "${?}"
0
PS
You could of course define a filter
yourself and use:
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, filter);
as explained in the eigenstate link, to allow _exit()
(or, strictly speaking, exit_group(2)
), but do that only if you really need to and know what you are doing.