diff --git a/man/systemd.exec.xml b/man/systemd.exec.xml index 8963764bf6..bacd539b15 100644 --- a/man/systemd.exec.xml +++ b/man/systemd.exec.xml @@ -1540,24 +1540,29 @@ RestrictNamespaces=~cgroup net SystemCallFilter= - Takes a space-separated list of system call names. If this setting is used, all system calls - executed by the unit processes except for the listed ones will result in immediate process termination with the - SIGSYS signal (whitelisting). If the first character of the list is ~, - the effect is inverted: only the listed system calls will result in immediate process termination - (blacklisting). Blacklisted system calls and system call groups may optionally be suffixed with a colon - (:) and errno error number (between 0 and 4095) or errno name such as - EPERM, EACCES or EUCLEAN. This value will be - returned when a blacklisted system call is triggered, instead of terminating the processes immediately. This - value takes precedence over the one given in SystemCallErrorNumber=. If running in user - mode, or in system mode, but without the CAP_SYS_ADMIN capability (e.g. setting - User=nobody), NoNewPrivileges=yes is implied. This feature makes use of - the Secure Computing Mode 2 interfaces of the kernel ('seccomp filtering') and is useful for enforcing a - minimal sandboxing environment. Note that the execve, exit, - exit_group, getrlimit, rt_sigreturn, - sigreturn system calls and the system calls for querying time and sleeping are implicitly - whitelisted and do not need to be listed explicitly. This option may be specified more than once, in which case - the filter masks are merged. If the empty string is assigned, the filter is reset, all prior assignments will - have no effect. This does not affect commands prefixed with +. + Takes a space-separated list of system call names. If this setting is used, all + system calls executed by the unit processes except for the listed ones will result in immediate + process termination with the SIGSYS signal (whitelisting). (See + SystemCallErrorNumber= below for changing the default action). If the first + character of the list is ~, the effect is inverted: only the listed system calls + will result in immediate process termination (blacklisting). Blacklisted system calls and system call + groups may optionally be suffixed with a colon (:) and errno + error number (between 0 and 4095) or errno name such as EPERM, + EACCES or EUCLEAN (see errno3 for a + full list). This value will be returned when a blacklisted system call is triggered, instead of + terminating the processes immediately. This value takes precedence over the one given in + SystemCallErrorNumber=, see below. If running in user mode, or in system mode, + but without the CAP_SYS_ADMIN capability (e.g. setting + User=nobody), NoNewPrivileges=yes is implied. This feature + makes use of the Secure Computing Mode 2 interfaces of the kernel ('seccomp filtering') and is useful + for enforcing a minimal sandboxing environment. Note that the execve, + exit, exit_group, getrlimit, + rt_sigreturn, sigreturn system calls and the system calls + for querying time and sleeping are implicitly whitelisted and do not need to be listed + explicitly. This option may be specified more than once, in which case the filter masks are + merged. If the empty string is assigned, the filter is reset, all prior assignments will have no + effect. This does not affect commands prefixed with +. Note that on systems supporting multiple ABIs (such as x86/x86-64) it is recommended to turn off alternative ABIs for services, so that they cannot be used to circumvent the restrictions of this @@ -1717,6 +1722,22 @@ RestrictNamespaces=~cgroup net SystemCallFilter=@system-service SystemCallErrorNumber=EPERM + Note that various kernel system calls are defined redundantly: there are multiple system calls + for executing the same operation. For example, the pidfd_send_signal() system + call may be used to execute operations similar to what can be done with the older + kill() system call, hence blocking the latter without the former only provides + weak protection. Since new system calls are added regularly to the kernel as development progresses, + keeping system call blacklists comprehensive requires constant work. It is thus recommended to use + whitelisting instead, which offers the benefit that new system calls are by default implicitly + blocked until the whitelist is updated. + + Also note that a number of system calls are required to be accessible for the dynamic linker to + work. The dynamic linker is required for running most regular programs (specifically: all dynamic ELF + binaries, which is how most distributions build packaged programs). This means that blocking these + system calls (which include open(), openat() or + mmap()) will make most programs typically shipped with generic distributions + unusable. + It is recommended to combine the file system namespacing related options with SystemCallFilter=~@mount, in order to prohibit the unit's processes to undo the mappings. Specifically these are the options PrivateTmp=, @@ -1729,11 +1750,13 @@ SystemCallErrorNumber=EPERM SystemCallErrorNumber= - Takes an errno error number (between 1 and 4095) or errno name such as - EPERM, EACCES or EUCLEAN, to return when the - system call filter configured with SystemCallFilter= is triggered, instead of terminating - the process immediately. When this setting is not used, or when the empty string is assigned, the process will - be terminated immediately when the filter is triggered. + Takes an errno error number (between 1 and 4095) or errno name + such as EPERM, EACCES or EUCLEAN, to + return when the system call filter configured with SystemCallFilter= is triggered, + instead of terminating the process immediately. See errno3 for a + full list of error codes. When this setting is not used, or when the empty string is assigned, the + process will be terminated immediately when the filter is triggered.