Commit graph

132 commits

Author SHA1 Message Date
Christian Ehrhardt 5ef3ed97e3
seccomp: use per arch shmat_syscall
At the beginning of seccomp_memory_deny_write_execute architectures
can set individual filter_syscall, block_syscall, shmat_syscall values.
The former two are then used in the call to add_seccomp_syscall_filter
but shmat_syscall is not.

Right now all shmat_syscall values are the same, so the change is a
no-op, but if ever an architecture is added/modified this would be a
subtle source for a mistake so fix it by using shmat_syscall later.

Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com>
2019-12-05 07:19:12 +01:00
Christian Ehrhardt 903659e7b2
seccomp: ensure rules are loaded in seccomp_memory_deny_write_execute
If seccomp_memory_deny_write_execute was fatally failing to load rules it
already returned a bad retval.
But if any adding filters failed it skipped the subsequent seccomp_load and
always returned an rc of 0 even if no rule was loaded at all.

Lets fix this requiring to (non fatally-failing) load at least one rule set.

Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com>
2019-12-05 07:19:12 +01:00
Christian Ehrhardt bed4668d1d
seccomp: fix multiplexed system calls
Since libseccomp 2.4.2 more architectures have shmat handled as multiplexed
call. Those will fail to be added due to seccomp_rule_add_exact failing
on them since they'd need to add multiple rules [1].
See the discussion at https://github.com/seccomp/libseccomp/issues/193

After discussions about the options rejected [2][3] the initial thought of
a fallback to the non '_exact' version of the seccomp rule adding the next
option is to handle those now affected (i386, s390, s390x) the same way as
ppc which ignores and does not block shmat.

[1]: https://github.com/seccomp/libseccomp/issues/193
[2]: https://github.com/systemd/systemd/pull/14167#issuecomment-559136906
[3]: https://github.com/systemd/systemd/commit/469830d1
2019-12-05 07:19:07 +01:00
Kevin Kuehler 620dbdd248 shared: Add ProtectKernelLogs property
Add seccomp_protect_syslog, which adds a filter rule for the syslog
system call.
2019-11-11 12:11:56 -08:00
Zbigniew Jędrzejewski-Szmek 9493b16871 Add @pkey syscall group
Inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1769299.
This change doesn't solve the issue, but makes it easier to whitelist the
syscall group.
2019-11-08 14:41:22 +01:00
Zbigniew Jędrzejewski-Szmek 6ca6771069 seccomp: add all *time64 syscalls
From https://bugzilla.redhat.com/show_bug.cgi?id=1770154:
> utime is an obsolete system call. The current kernel interface is
> utimensat_time64. New 32-bit architectures do not even provide the utime
> system call.

Also add all other *time64 syscalls listed in
https://fedora.juszkiewicz.com.pl/syscalls.html.
2019-11-08 14:40:49 +01:00
Lennart Poettering 9e48626571 seccomp: add new Linux 5.3 syscalls to syscall filter lists
Many syscalls added and all fit nicely into existing groups, hence lets
add them there.
2019-10-30 15:42:49 +01:00
Zbigniew Jędrzejewski-Szmek a8fb09f573 shared/seccomp: add sync_file_range2
Some architectures need the arguments to be reordered because of alignment
issues. Otherwise, it's the same as sync_file_range.
2019-08-19 11:10:40 +02:00
Dan Streetman 57311925aa src/shared/seccomp-util.c: Add mmap definitions for s390 2019-08-13 15:40:36 -04:00
Lennart Poettering 46fcf95dbe seccomp: add new 5.1 syscall pidfd_send_signal() to filter set list 2019-05-28 17:01:05 +02:00
Lennart Poettering 915fb32438 seccomp: add scmp_act_kill_process() helper that returns SCMP_ACT_KILL_PROCESS if supported 2019-05-24 10:48:28 +02:00
Anita Zhang 7bc5e0b12b seccomp: check more error codes from seccomp_load()
We noticed in our tests that occasionally SystemCallFilter= would
fail to set and the service would run with no syscall filtering.
Most of the time the same tests would apply the filter and fail
the service as expected. While it's not totally clear why this happens,
we noticed seccomp_load() in the systemd code base would fail open for
all errors except EPERM and EACCES.

ENOMEM, EINVAL, and EFAULT seem like reasonable values to add to the
error set based on what I gather from libseccomp code and man pages:

-ENOMEM: out of memory, failed to allocate space for a libseccomp structure, or would exceed a defined constant
-EINVAL: kernel isn't configured to support the operations, args are invalid (to seccomp_load(), seccomp(), or prctl())
-EFAULT: addresses passed as args are invalid
2019-04-12 10:23:07 +02:00
Zbigniew Jędrzejewski-Szmek b3e8032bb4
Merge pull request #12198 from keszybz/seccomp-parsing-logging
Seccomp parsing logging cleanup
2019-04-03 17:19:14 +02:00
Zbigniew Jędrzejewski-Szmek da4dc9a674 seccomp: rework how the S[UG]ID filter is installed
If we know that a syscall is undefined on the given architecture, don't
even try to add it.

Try to install the filter even if some syscalls fail. Also use a helper
function to make the whole a bit less magic.

This allows the S[UG]ID test to pass on arm64.
2019-04-03 13:33:06 +02:00
Zbigniew Jędrzejewski-Szmek 58f6ab4454 pid1: pass unit name to seccomp parser when we have no file location
Building on previous commit, let's pass the unit name when parsing
dbus message or builtin whitelist, which is better than nothing.

seccomp_parse_syscall_filter() is not needed anymore, so it is removed,
and seccomp_parse_syscall_filter_full() is renamed to take its place.
2019-04-03 09:17:42 +02:00
Lennart Poettering 3c27973b13 seccomp: introduce seccomp_restrict_suid_sgid() for blocking chmod() for suid/sgid files 2019-04-02 16:56:48 +02:00
Lennart Poettering 9e6e543c17 seccomp: add debug messages to seccomp_protect_hostname() 2019-04-02 16:56:48 +02:00
Lennart Poettering 6fee3be0b4 seccomp: add rseq() to default list of syscalls to whitelist
Apparently glibc is going to call this implicitly soon, hence let's
whitelist this by default.

Fixes: #12127
2019-03-28 12:09:38 +01:00
Zbigniew Jędrzejewski-Szmek 67fb5f338f seccomp: allow shmat to be a separate syscall on architectures which use a multiplexer
After
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0d6040d46817,
those syscalls have their separate numbers and we can block them.
But glibc might still use the old ones. So let's just do a best-effort
block and not assume anything about how effective it is.
2019-03-15 15:46:41 +01:00
Zbigniew Jędrzejewski-Szmek e55bdf9b6c seccomp: shm{get,at,dt} now have their own numbers everywhere
E.g. on i686:

(previously)
arch x86: SCMP_SYS(mmap) = 90
arch x86: SCMP_SYS(mmap2) = 192
arch x86: SCMP_SYS(shmat) = -221
arch x86: SCMP_SYS(shmat) = -221
arch x86: SCMP_SYS(shmdt) = -222

(now)
arch x86: SCMP_SYS(mmap) = 90
arch x86: SCMP_SYS(mmap2) = 192
arch x86: SCMP_SYS(shmat) = 397
arch x86: SCMP_SYS(shmat) = 397
arch x86: SCMP_SYS(shmdt) = 398

The relevant commit seems to be
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0d6040d46817.
2019-03-15 15:28:43 +01:00
Lennart Poettering d8b4d14df4 util: split out nulstr related stuff to nulstr-util.[ch] 2019-03-14 13:25:52 +01:00
Topi Miettinen aecd5ac621 core: ProtectHostname= feature
Let services use a private UTS namespace. In addition, a seccomp filter is
installed on set{host,domain}name and a ro bind mounts on
/proc/sys/kernel/{host,domain}name.
2019-02-20 10:50:44 +02:00
Lennart Poettering 57c03b1e6e seccomp: drop mincore() from @system-service syscall filter group
Previously, this system call was included in @system-service since it is
a "getter" only, i.e. only queries information, and doesn't change
anything, and hence was considered not risky.

However, as it turns out, mincore() is actually security sensitive, see
the discussion here:

https://lwn.net/Articles/776034/

Hence, let's adjust the system call filter and drop mincore() from it.

This constitues a compatibility break to some level, however I presume
we can get away with this as the systemcall is pretty exotic. The fact
that it is pretty exotic is also reflected by the fact that the kernel
intends to majorly change behaviour of the system call soon (see the
linked LWN article)
2019-01-16 18:08:35 +01:00
Lennart Poettering ad5ffe3716 seccomp-util: drop process_vm_readv from @debug group
it's already part of @ipc, no need to have it in both. Given that @ipc
is much more popular (as it is part of @system-service for example),
let's not define it a second time.
2018-11-30 16:46:09 +01:00
Zbigniew Jędrzejewski-Szmek baaa35ad70 coccinelle: make use of SYNTHETIC_ERRNO
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.

I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
2018-11-22 10:54:38 +01:00
Lennart Poettering a05cfe230f seccomp: add some missing syscalls to filter sets 2018-11-16 16:10:57 +01:00
Zbigniew Jędrzejewski-Szmek a90db619ca shared: fix typo 2018-11-10 07:43:57 +01:00
Yu Watanabe 14cb109d45 tree-wide: replace 'unsigned int' with 'unsigned' 2018-10-19 22:19:12 +02:00
Zbigniew Jędrzejewski-Szmek 7e86bd73a4 seccomp: tighten checking of seccomp filter creation
In seccomp code, the code is changed to propagate errors which are about
anything other than unknown/unimplemented syscalls. I *think* such errors
should not happen in normal usage, but so far we would summarilly ignore all
errors, so that part is uncertain. If it turns out that other errors occur and
should be ignored, this should be added later.

In nspawn, we would count the number of added filters, but didn't use this for
anything. Drop that part.

The comments suggested that seccomp_add_syscall_filter_item() returned negative
if the syscall is unknown, but this wasn't true: it returns 0.

The error at this point can only be if the syscall was known but couldn't be
added. If the error comes from our internal whitelist in nspawn, treat this as
error, because it means that our internal table is wrong. If the error comes
from user arguments, warn and ignore. (If some syscall is not known at current
architecture, it is still silently ignored.)
2018-09-24 17:21:09 +02:00
Zbigniew Jędrzejewski-Szmek b54f36c604 seccomp: reduce logging about failure to add syscall to seccomp
Our logs are full of:
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldstat() / -10037, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call get_thread_area() / -10076, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call set_thread_area() / -10079, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldfstat() / -10034, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldolduname() / -10036, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldlstat() / -10035, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call waitpid() / -10073, ignoring: Numerical argument out of domain
...
This is pointless and makes debug logs hard to read. Let's keep the logs
in test code, but disable it in nspawn and pid1. This is done through a function
parameter because those functions operate recursively and it's not possible to
make the caller to log meaningfully.


There should be no functional change, except the skipped debug logs.
2018-09-24 17:21:09 +02:00
Lucas Werkmeister 9d7fe7c65a seccomp: permit specifying multiple errnos for a syscall
If more than one errno is specified for a syscall in SystemCallFilter=,
use the last one instead of reporting an error. This is especially
useful when used with system call sets:

    SystemCallFilter=@privileged:EPERM @reboot

This will block any system call requiring super-user capabilities with
EPERM, except for attempts to reboot the system, which will immediately
terminate the process. (@reboot is included in @privileged.)

This also effectively fixes #9939, since specifying different errnos for
“the same syscall” (same pseudo syscall number) is no longer an error.
2018-09-07 21:44:13 +02:00
Lucas Werkmeister 851ee70a3d seccomp: improve error reporting
Only report OOM if that was actually the error of the operation,
explicitly report the possible error that a syscall was already blocked
with a different errno and translate that into a more sensible errno
(EEXIST only makes sense in connection to the hashmap), and pass through
all other potential errors unmodified. Part of #9939.
2018-08-29 21:42:03 +02:00
Lion Yang a9518dc369 seccomp: add swapcontext into @process for ppc32
There are some modern programming languages use userspace context switches
to implement coroutine features. PowerPC (32-bit) needs syscall "swapcontext" to get
contexts or switch between contexts, which is special.

Adding this rule should fix #9485.
2018-07-03 13:35:02 +02:00
Lennart Poettering e05ee49b14 seccomp: explain why we use setuid rather than @setuid in @privileged 2018-06-14 17:44:20 +02:00
Lennart Poettering 705268414f seccomp: add new system call filter, suitable as default whitelist for system services
Currently we employ mostly system call blacklisting for our system
services. Let's add a new system call filter group @system-service that
helps turning this around into a whitelist by default.

The new group is very similar to nspawn's default filter list, but in
some ways more restricted (as sethostname() and suchlike shouldn't be
available to most system services just like that) and in others more
relaxed (for example @keyring is blocked in nspawn since it's not
properly virtualized yet in the kernel, but is fine for regular system
services).
2018-06-14 17:44:20 +02:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Yu Watanabe 86c2a9f1c2 nsflsgs: drop namespace_flag_{from,to}_string()
This also drops namespace_flag_to_string_many_with_check(), and
renames namespace_flag_{from,to}_string_many() to
namespace_flags_{from,to}_string().
2018-05-05 11:07:37 +09:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Yu Watanabe 1cc6c93a95 tree-wide: use TAKE_PTR() and TAKE_FD() macros 2018-04-05 14:26:26 +09:00
James Cowgill 303d6b4ca6 Partially revert "seccomp: add mmap and address family restrictions for MIPS" (#8563)
This reverts the mmap parts of f5aeac1439,
but keeps the part which restricts address families which works
correctly.

Unfortunately the MIPS toolchains still do not implement PT_GNU_STACK.
This means that while the commit to restrict mmap on MIPS was "correct",
it had the side effect of causing pthread_create to fail because glibc tries
to allocate an executable stack for new threads in the absense of
PT_GNU_STACK. We should wait until PT_GNU_STACK is implemented in all
the relevant parts of the toolchain (at least gcc and glibc) before
enabling this again.
2018-03-23 16:04:16 +01:00
James Cowgill f5aeac1439 seccomp: add mmap and address family restrictions for MIPS (#8547) 2018-03-22 15:40:44 +01:00
Mathieu Malaterre 0d9fca76bb seccomp: enable RestrictAddressFamilies on ppc (#8505)
In commit da1921a5c3 ppc64/ppc64el were added as supported architectures for
socketcall() for the POWER family. Extend the support for the 32bits
architectures.
2018-03-20 16:08:20 +01:00
Lennart Poettering 13d92c6300 seccomp: rework functions for parsing system call filters
This reworks system call filter parsing, and replaces a couple of "bool"
function arguments by a single flags parameter.

This shouldn't change behaviour, except for one case: when we
recursively call our parsing function on our own syscall list, then
we'll lower the log level to LOG_DEBUG from LOG_WARNING, because at that
point things are just a problem in our own code rather than in the user
configuration we are parsing, and we shouldn't hence generate confusing
warnings about syntax errors.

Fixes: #8261
2018-02-27 19:59:09 +01:00
Alan Jenkins 2428aaf8a2 seccomp: allow x86-64 syscalls on x32, used by the VDSO (fix #8060)
The VDSO provided by the kernel for x32, uses x86-64 syscalls instead of
x32 ones.

I think we can safely allow this; the set of x86-64 syscalls should be
very similar to the x32 ones.  The real point is not to allow *x86*
syscalls, because some of those are inconveniently multiplexed and we're
apparently not able to block the specific actions we want to.
2018-02-02 18:12:34 +00:00
Alan Jenkins 5c19ff79de seccomp-util: fix alarming debug message (#8002, #8001)
Booting with `systemd.log_level=debug` and looking in `dmesg -u` showed
messages like this:

    systemd[433]: Failed to add rule for system call n/a() / 156, ignoring:
    Numerical argument out of domain

This commit fixes it to:

    systemd[449]: Failed to add rule for system call _sysctl() / 156,
    ignoring: Numerical argument out of domain

Some of the messages could be even more misleading, e.g. we were reporting
that utimensat() / 320 was skipped as non-existent on x86, when actually
the syscall number 320 is kexec_file_load() on x86 .

The problem was that syscall NRs are looked up (and correctly passed to
libseccomp) as native syscall NRs.  But we forgot that when we tried to
go back from the syscall NR to the name.

I think the natural way to write this would be
seccomp_syscall_resolve_num(nr), however there is no such function.
I couldn't work out a short comment that would make this clearer.  FWIW
I wrote it up as a ticket for libseccomp instead.
https://github.com/seccomp/libseccomp/issues/104
2018-01-31 17:20:14 +00:00
Lennart Poettering 7785da68e6
Merge pull request #7695 from yuwata/transient-socket
DBus-API: implement transient socket unit
2017-12-23 19:20:29 +01:00
Yu Watanabe 898748d8b9 core,seccomp: fix logic to parse syscall filter in dbus-execute.c
If multiple SystemCallFilter= settings, some of them are whitelist
and the others are blacklist, are sent to bus, then the parse
result was corrupted.
This fixes the parse logic, now it is the same as one used in
load-fragment.c
2017-12-23 18:45:32 +09:00
Mathieu Malaterre 63d00dfb64 shared/seccomp: add mmap handling for powerpc
Also remove the warning:

./src/shared/seccomp-util.c:1414:2: warning: #warning "Consider adding the right mmap() syscall definitions here!" [-Wcpp]
 #warning "Consider adding the right mmap() syscall definitions here!"
2017-12-22 15:30:03 +01:00
Lennart Poettering f1d34068ef tree-wide: add DEBUG_LOGGING macro that checks whether debug logging is on (#7645)
This makes things a bit easier to read I think, and also makes sure we
always use the _unlikely_ wrapper around it, which so far we used
sometimes and other times we didn't. Let's clean that up.
2017-12-15 11:09:00 +01:00