Let services use a private UTS namespace. In addition, a seccomp filter is
installed on set{host,domain}name and a ro bind mounts on
/proc/sys/kernel/{host,domain}name.
Previously, this system call was included in @system-service since it is
a "getter" only, i.e. only queries information, and doesn't change
anything, and hence was considered not risky.
However, as it turns out, mincore() is actually security sensitive, see
the discussion here:
https://lwn.net/Articles/776034/
Hence, let's adjust the system call filter and drop mincore() from it.
This constitues a compatibility break to some level, however I presume
we can get away with this as the systemcall is pretty exotic. The fact
that it is pretty exotic is also reflected by the fact that the kernel
intends to majorly change behaviour of the system call soon (see the
linked LWN article)
it's already part of @ipc, no need to have it in both. Given that @ipc
is much more popular (as it is part of @system-service for example),
let's not define it a second time.
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.
I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
In seccomp code, the code is changed to propagate errors which are about
anything other than unknown/unimplemented syscalls. I *think* such errors
should not happen in normal usage, but so far we would summarilly ignore all
errors, so that part is uncertain. If it turns out that other errors occur and
should be ignored, this should be added later.
In nspawn, we would count the number of added filters, but didn't use this for
anything. Drop that part.
The comments suggested that seccomp_add_syscall_filter_item() returned negative
if the syscall is unknown, but this wasn't true: it returns 0.
The error at this point can only be if the syscall was known but couldn't be
added. If the error comes from our internal whitelist in nspawn, treat this as
error, because it means that our internal table is wrong. If the error comes
from user arguments, warn and ignore. (If some syscall is not known at current
architecture, it is still silently ignored.)
Our logs are full of:
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldstat() / -10037, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call get_thread_area() / -10076, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call set_thread_area() / -10079, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldfstat() / -10034, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldolduname() / -10036, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldlstat() / -10035, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call waitpid() / -10073, ignoring: Numerical argument out of domain
...
This is pointless and makes debug logs hard to read. Let's keep the logs
in test code, but disable it in nspawn and pid1. This is done through a function
parameter because those functions operate recursively and it's not possible to
make the caller to log meaningfully.
There should be no functional change, except the skipped debug logs.
If more than one errno is specified for a syscall in SystemCallFilter=,
use the last one instead of reporting an error. This is especially
useful when used with system call sets:
SystemCallFilter=@privileged:EPERM @reboot
This will block any system call requiring super-user capabilities with
EPERM, except for attempts to reboot the system, which will immediately
terminate the process. (@reboot is included in @privileged.)
This also effectively fixes#9939, since specifying different errnos for
“the same syscall” (same pseudo syscall number) is no longer an error.
Only report OOM if that was actually the error of the operation,
explicitly report the possible error that a syscall was already blocked
with a different errno and translate that into a more sensible errno
(EEXIST only makes sense in connection to the hashmap), and pass through
all other potential errors unmodified. Part of #9939.
There are some modern programming languages use userspace context switches
to implement coroutine features. PowerPC (32-bit) needs syscall "swapcontext" to get
contexts or switch between contexts, which is special.
Adding this rule should fix#9485.
Currently we employ mostly system call blacklisting for our system
services. Let's add a new system call filter group @system-service that
helps turning this around into a whitelist by default.
The new group is very similar to nspawn's default filter list, but in
some ways more restricted (as sethostname() and suchlike shouldn't be
available to most system services just like that) and in others more
relaxed (for example @keyring is blocked in nspawn since it's not
properly virtualized yet in the kernel, but is fine for regular system
services).
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
This reverts the mmap parts of f5aeac1439,
but keeps the part which restricts address families which works
correctly.
Unfortunately the MIPS toolchains still do not implement PT_GNU_STACK.
This means that while the commit to restrict mmap on MIPS was "correct",
it had the side effect of causing pthread_create to fail because glibc tries
to allocate an executable stack for new threads in the absense of
PT_GNU_STACK. We should wait until PT_GNU_STACK is implemented in all
the relevant parts of the toolchain (at least gcc and glibc) before
enabling this again.
In commit da1921a5c3 ppc64/ppc64el were added as supported architectures for
socketcall() for the POWER family. Extend the support for the 32bits
architectures.
This reworks system call filter parsing, and replaces a couple of "bool"
function arguments by a single flags parameter.
This shouldn't change behaviour, except for one case: when we
recursively call our parsing function on our own syscall list, then
we'll lower the log level to LOG_DEBUG from LOG_WARNING, because at that
point things are just a problem in our own code rather than in the user
configuration we are parsing, and we shouldn't hence generate confusing
warnings about syntax errors.
Fixes: #8261
The VDSO provided by the kernel for x32, uses x86-64 syscalls instead of
x32 ones.
I think we can safely allow this; the set of x86-64 syscalls should be
very similar to the x32 ones. The real point is not to allow *x86*
syscalls, because some of those are inconveniently multiplexed and we're
apparently not able to block the specific actions we want to.
Booting with `systemd.log_level=debug` and looking in `dmesg -u` showed
messages like this:
systemd[433]: Failed to add rule for system call n/a() / 156, ignoring:
Numerical argument out of domain
This commit fixes it to:
systemd[449]: Failed to add rule for system call _sysctl() / 156,
ignoring: Numerical argument out of domain
Some of the messages could be even more misleading, e.g. we were reporting
that utimensat() / 320 was skipped as non-existent on x86, when actually
the syscall number 320 is kexec_file_load() on x86 .
The problem was that syscall NRs are looked up (and correctly passed to
libseccomp) as native syscall NRs. But we forgot that when we tried to
go back from the syscall NR to the name.
I think the natural way to write this would be
seccomp_syscall_resolve_num(nr), however there is no such function.
I couldn't work out a short comment that would make this clearer. FWIW
I wrote it up as a ticket for libseccomp instead.
https://github.com/seccomp/libseccomp/issues/104
If multiple SystemCallFilter= settings, some of them are whitelist
and the others are blacklist, are sent to bus, then the parse
result was corrupted.
This fixes the parse logic, now it is the same as one used in
load-fragment.c
Also remove the warning:
./src/shared/seccomp-util.c:1414:2: warning: #warning "Consider adding the right mmap() syscall definitions here!" [-Wcpp]
#warning "Consider adding the right mmap() syscall definitions here!"
This makes things a bit easier to read I think, and also makes sure we
always use the _unlikely_ wrapper around it, which so far we used
sometimes and other times we didn't. Let's clean that up.
When compiling with an old kernel on architectures for which the
number is not defined in missing.h, a warning is generated in missing.h.
Let's just skip the protection in this case, to allow build to proceed.
MemoryDenyWriteExecution policy could be be bypassed by using pkey_mprotect
instead of mprotect to create an executable writable mapping.
The impact is mitigated by the fact that the man page says "Note that this
feature is fully available on x86-64, and partially on x86", so hopefully
people do not rely on it as a sole security measure.
Found by Karin Hossen and Thomas Imbert from Sogeti ESEC R&D.
https://bugs.launchpad.net/bugs/1725348
This makes each system call in SystemCallFilter= blacklist optionally
takes errno name or number after a colon. The errno takes precedence
over the one given by SystemCallErrorNumber=.
C.f. #7173.
Closes#7169.
The gettid syscall is one of the most basic syscalls, it never fails and
it operates on current thread. Most applications are not suposed to use
it, however even if it is used there is no much justification on blocking
it. This patch removes it from '@process' set so if users blacklist this
set to block setns or clone syscalls, the gettid syscall will still be
available. Of course they can always block gettid explicitly.
Note that the gettid is already in the '@default' set.
System calls might exist on some archs but not on others, or might be
multiplexed but not on others. Ignore such errors when putting together
a filter at this location like we already do it on all others.
Unfortunately libseccomp doesn't return (nor document) clean error
codes, hence until then only check for specific error codes that we
propagate, but ignore (but debug log) all others. Do this at one more
place, we are already doing that at all others.
When a libseccomp implementation doesn't know a syscall yet, that's no
reason for us to fail completely. Instead, debug log, and proceed.
This hopefully fixes the preadv2/pwritev2 issues pointed out here:
https://github.com/systemd/systemd/pull/6952#issuecomment-334302923
Also, move prlimit64() out of @resources.
prlimit64() may be used both for getting and setting resource limits, and
is implicitly called by glibc at various places, on some archs, the same
was as getrlimit(). SImilar, igetrlimit() is an arch-specific
replacement for getrlimit(), and hence should be whitelisted at the same
place as getrlimit() and prlimit64().
Also see: https://lists.freedesktop.org/archives/systemd-devel/2017-September/039543.html
This removes the '@credentials' syscall set that was added in commit
v234-468-gcd0ddf6f75.
Most of these syscalls are so simple that we do not want to filter them.
They work on the current calling process, doing only read operations,
they do not have a deep kernel path.
The problem may only be in 'capget' syscall since it can query arbitrary
processes, and used to discover processes, however sending signal 0 to
arbitrary processes can be used to discover if a process exists or not.
It is unfortunate that Linux allows to query processes of different
users. Lets put it now in '@process' syscall set, and later we may add
it to a new '@basic-process' set that allows most basic process
operations.