When seccomp_restrict_archs is called, architectures that are blocked
are replaced by the SECCOMP_LOCAL_ARCH_BLOCKED marker so that they are
not disabled again and filters are not installed for them.
This can make some service that use SystemCallArchitecture= and
SystemCallFilter= start faster.
Define explicit action "kill" for SystemCallErrorNumber=.
In addition to errno code, allow specifying "kill" as action for
SystemCallFilter=.
---
v7: seccomp_parse_errno_or_action() returns -EINVAL if !HAVE_SECCOMP
v6: use streq_ptr(), let errno_to_name() handle bad values, kill processes,
init syscall_errno
v5: actually use seccomp_errno_or_action_to_string(), don't fail bus unit
parsing without seccomp
v4: fix build without seccomp
v3: drop log action
v2: action -> number
https://tools.ietf.org/html/draft-knodel-terminology-02https://lwn.net/Articles/823224/
This gets rid of most but not occasions of these loaded terms:
1. scsi_id and friends are something that is supposed to be removed from
our tree (see #7594)
2. The test suite defines an API used by the ubuntu CI. We can remove
this too later, but this needs to be done in sync with the ubuntu CI.
3. In some cases the terms are part of APIs we call or where we expose
concepts the kernel names the way it names them. (In particular all
remaining uses of the word "slave" in our codebase are like this,
it's used by the POSIX PTY layer, by the network subsystem, the mount
API and the block device subsystem). Getting rid of the term in these
contexts would mean doing some major fixes of the kernel ABI first.
Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
Patch contains a coccinelle script, but it only works in some cases. Many
parts were converted by hand.
Note: I did not fix errors in return value handing. This will be done separate
to keep the patch comprehensible. No functional change is intended in this
patch.
We noticed in our tests that occasionally SystemCallFilter= would
fail to set and the service would run with no syscall filtering.
Most of the time the same tests would apply the filter and fail
the service as expected. While it's not totally clear why this happens,
we noticed seccomp_load() in the systemd code base would fail open for
all errors except EPERM and EACCES.
ENOMEM, EINVAL, and EFAULT seem like reasonable values to add to the
error set based on what I gather from libseccomp code and man pages:
-ENOMEM: out of memory, failed to allocate space for a libseccomp structure, or would exceed a defined constant
-EINVAL: kernel isn't configured to support the operations, args are invalid (to seccomp_load(), seccomp(), or prctl())
-EFAULT: addresses passed as args are invalid
Building on previous commit, let's pass the unit name when parsing
dbus message or builtin whitelist, which is better than nothing.
seccomp_parse_syscall_filter() is not needed anymore, so it is removed,
and seccomp_parse_syscall_filter_full() is renamed to take its place.
Let services use a private UTS namespace. In addition, a seccomp filter is
installed on set{host,domain}name and a ro bind mounts on
/proc/sys/kernel/{host,domain}name.
Our logs are full of:
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldstat() / -10037, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call get_thread_area() / -10076, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call set_thread_area() / -10079, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldfstat() / -10034, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldolduname() / -10036, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldlstat() / -10035, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call waitpid() / -10073, ignoring: Numerical argument out of domain
...
This is pointless and makes debug logs hard to read. Let's keep the logs
in test code, but disable it in nspawn and pid1. This is done through a function
parameter because those functions operate recursively and it's not possible to
make the caller to log meaningfully.
There should be no functional change, except the skipped debug logs.
Currently we employ mostly system call blacklisting for our system
services. Let's add a new system call filter group @system-service that
helps turning this around into a whitelist by default.
The new group is very similar to nspawn's default filter list, but in
some ways more restricted (as sethostname() and suchlike shouldn't be
available to most system services just like that) and in others more
relaxed (for example @keyring is blocked in nspawn since it's not
properly virtualized yet in the kernel, but is fine for regular system
services).
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
Let's always write "1 << 0", "1 << 1" and so on, except where we need
more than 31 flag bits, where we write "UINT64(1) << 0", and so on to force
64bit values.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
This reworks system call filter parsing, and replaces a couple of "bool"
function arguments by a single flags parameter.
This shouldn't change behaviour, except for one case: when we
recursively call our parsing function on our own syscall list, then
we'll lower the log level to LOG_DEBUG from LOG_WARNING, because at that
point things are just a problem in our own code rather than in the user
configuration we are parsing, and we shouldn't hence generate confusing
warnings about syntax errors.
Fixes: #8261
If multiple SystemCallFilter= settings, some of them are whitelist
and the others are blacklist, are sent to bus, then the parse
result was corrupted.
This fixes the parse logic, now it is the same as one used in
load-fragment.c
This makes each system call in SystemCallFilter= blacklist optionally
takes errno name or number after a colon. The errno takes precedence
over the one given by SystemCallErrorNumber=.
C.f. #7173.
Closes#7169.
This removes the '@credentials' syscall set that was added in commit
v234-468-gcd0ddf6f75.
Most of these syscalls are so simple that we do not want to filter them.
They work on the current calling process, doing only read operations,
they do not have a deep kernel path.
The problem may only be in 'capget' syscall since it can query arbitrary
processes, and used to discover processes, however sending signal 0 to
arbitrary processes can be used to discover if a process exists or not.
It is unfortunate that Linux allows to query processes of different
users. Lets put it now in '@process' syscall set, and later we may add
it to a new '@basic-process' set that allows most basic process
operations.
Now that we have ported nspawn's seccomp code to the generic code in
seccomp-util, let's extend it to support whitelisting and blacklisting
of specific additional syscalls.
This uses similar syntax as PID1's support for system call filtering,
but in contrast to that always implements a blacklist (and not a
whitelist), as we prepopulate the filter with a blacklist, and the
unit's system call filter logic does not come with anything
prepopulated.
(Later on we might actually want to invert the logic here, and
whitelist rather than blacklist things, but at this point let's not do
that. In case we switch this over later, the syscall add/remove logic of
this commit should be compatible conceptually.)
Fixes: #5163
Replaces: #5944
Let's add a new helper function seccomp_add_syscall_filter_item() that
contains the inner loop code of seccomp_add_syscall_filter_set(). This
helper function we can then export and make use of elsewhere.
Add LockPersonality boolean to allow locking down personality(2)
system call so that the execution domain can't be changed.
This may be useful to improve security because odd emulations
may be poorly tested and source of vulnerabilities, while
system services shouldn't need any weird personalities.
This patch adds two new special character prefixes to ExecStart= and
friends, in addition to the existing "-", "@" and "+":
"!" → much like "+", except with a much reduced effect as it only
disables the actual setresuid()/setresgid()/setgroups() calls, but
leaves all other security features on, including namespace
options. This is very useful in combination with
RuntimeDirectory= or DynamicUser= and similar option, as a user
is still allocated and used for the runtime directory, but the
actual UID/GID dropping is left to the daemon process itself.
This should make RuntimeDirectory= a lot more useful for daemons
which insist on doing their own privilege dropping.
"!!" → Similar to "!", but on systems supporting ambient caps this
becomes a NOP. This makes it relatively straightforward to write
unit files that make use of ambient capabilities to let systemd
drop all privs while retaining compatibility with systems that
lack ambient caps, where priv dropping is the left to the daemon
codes themselves.
This is an alternative approach to #6564 and related PRs.
This new group lists all UID/GID credential changing syscalls (which are
quite a number these days). This will become particularly useful in a
later commit, which uses this group to optionally permit user credential
changing to daemons in case ambient capabilities are not available.
We expect that if socket() syscall is available, seccomp works for that
architecture. So instead of explicitly listing all architectures where we know
it is not available, just assume it is broken if the number is not defined.
This should have the same effect, except that other architectures where it is
also broken will pass tests without further changes. (Architectures where the
filter should work, but does not work because of missing entries in
seccomp-util.c, will still fail.)
i386, s390, s390x are the exception — setting the filter fails, even though
socket() is available, so it needs to be special-cased
(https://github.com/systemd/systemd/issues/5215#issuecomment-277241488).
This remove the last define in seccomp-util.h that was only used in test-seccomp.c. Porting
the seccomp filter to new architectures should be simpler because now only two places need
to be modified.
RestrictAddressFamilies seems to work on ppc64[bl]e, so enable it (the tests pass).
While adding the defines for arm, I realized that we have pretty much all
known architectures covered, so SECCOMP_RESTRICT_NAMESPACES_BROKEN is not
necessary anymore. clone(2) is adamant that the order of the first two
arguments is only reversed on s390/s390x. So let's simplify things and remove
the #if.
SECCOMP_MEMORY_DENY_WRITE_EXECUTE_BROKEN was conflating two separate things:
1. whether shmat/shmdt/shmget can be filtered (if ipc multiplexer is used, they can not)
2. whether we know this for the current architecture
For i386, shmat is implemented as ipc, so seccomp filter is "broken" for shmat,
but not for mmap, and SECCOMP_MEMORY_DENY_WRITE_EXECUTE_BROKEN cannot be used
to cover both cases. The define was only used for tests — not in the implementation
in seccomp-util.c. So let's get rid of SECCOMP_MEMORY_DENY_WRITE_EXECUTE_BROKEN
and encode the right condition directly in tests.
Also updates the documentation and adds a mention of ppc64 support
which was enabled by #5325.
Tested on Debian mipsel and mips64el. The other 4 mips architectures
should have an identical user <-> kernel ABI to one of the 2 tested
systems.
Add a bit of code that tries to get the right parameter order in place
for some of the better known architectures, and skips
restrict_namespaces for other archs.
This also bypasses the test on archs where we don't know the right
order.
In this case I didn't bother with testing the case where no filter is
applied, since that is hopefully just an issue for now, as there's
nothing stopping us from supporting more archs, we just need to know
which order is right.
Fixes: #5241
On i386 we block the old mmap() call entirely, since we cannot properly
filter it. Thankfully it hasn't been used by glibc since quite some
time.
Fixes: #5240
This substantially reworks the seccomp code, to ensure better
compatibility with some architectures, including i386.
So far we relied on libseccomp's internal handling of the multiple
syscall ABIs supported on Linux. This is problematic however, as it does
not define clear semantics if an ABI is not able to support specific
seccomp rules we install.
This rework hence changes a couple of things:
- We no longer use seccomp_rule_add(), but only
seccomp_rule_add_exact(), and fail the installation of a filter if the
architecture doesn't support it.
- We no longer rely on adding multiple syscall architectures to a single filter,
but instead install a separate filter for each syscall architecture
supported. This way, we can install a strict filter for x86-64, while
permitting a less strict filter for i386.
- All high-level filter additions are now moved from execute.c to
seccomp-util.c, so that we can test them independently of the service
execution logic.
- Tests have been added for all types of our seccomp filters.
- SystemCallFilters= and SystemCallArchitectures= are now implemented in
independent filters and installation logic, as they semantically are
very much independent of each other.
Fixes: #4575
@filesystem groups various file system operations, such as opening files and
directories for read/write and stat()ing them, plus renaming, deleting,
symlinking, hardlinking.
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().
RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.
This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
@resources contains various syscalls that alter resource limits and memory and
scheduling parameters of processes. As such they are good candidates to block
for most services.
@basic-io contains a number of basic syscalls for I/O, similar to the list
seccomp v1 permitted but slightly more complete. It should be useful for
building basic whitelisting for minimal sandboxes