Commit graph

98 commits

Author SHA1 Message Date
Lennart Poettering 705268414f seccomp: add new system call filter, suitable as default whitelist for system services
Currently we employ mostly system call blacklisting for our system
services. Let's add a new system call filter group @system-service that
helps turning this around into a whitelist by default.

The new group is very similar to nspawn's default filter list, but in
some ways more restricted (as sethostname() and suchlike shouldn't be
available to most system services just like that) and in others more
relaxed (for example @keyring is blocked in nspawn since it's not
properly virtualized yet in the kernel, but is fine for regular system
services).
2018-06-14 17:44:20 +02:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Yu Watanabe 86c2a9f1c2 nsflsgs: drop namespace_flag_{from,to}_string()
This also drops namespace_flag_to_string_many_with_check(), and
renames namespace_flag_{from,to}_string_many() to
namespace_flags_{from,to}_string().
2018-05-05 11:07:37 +09:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Yu Watanabe 1cc6c93a95 tree-wide: use TAKE_PTR() and TAKE_FD() macros 2018-04-05 14:26:26 +09:00
James Cowgill 303d6b4ca6 Partially revert "seccomp: add mmap and address family restrictions for MIPS" (#8563)
This reverts the mmap parts of f5aeac1439,
but keeps the part which restricts address families which works
correctly.

Unfortunately the MIPS toolchains still do not implement PT_GNU_STACK.
This means that while the commit to restrict mmap on MIPS was "correct",
it had the side effect of causing pthread_create to fail because glibc tries
to allocate an executable stack for new threads in the absense of
PT_GNU_STACK. We should wait until PT_GNU_STACK is implemented in all
the relevant parts of the toolchain (at least gcc and glibc) before
enabling this again.
2018-03-23 16:04:16 +01:00
James Cowgill f5aeac1439 seccomp: add mmap and address family restrictions for MIPS (#8547) 2018-03-22 15:40:44 +01:00
Mathieu Malaterre 0d9fca76bb seccomp: enable RestrictAddressFamilies on ppc (#8505)
In commit da1921a5c3 ppc64/ppc64el were added as supported architectures for
socketcall() for the POWER family. Extend the support for the 32bits
architectures.
2018-03-20 16:08:20 +01:00
Lennart Poettering 13d92c6300 seccomp: rework functions for parsing system call filters
This reworks system call filter parsing, and replaces a couple of "bool"
function arguments by a single flags parameter.

This shouldn't change behaviour, except for one case: when we
recursively call our parsing function on our own syscall list, then
we'll lower the log level to LOG_DEBUG from LOG_WARNING, because at that
point things are just a problem in our own code rather than in the user
configuration we are parsing, and we shouldn't hence generate confusing
warnings about syntax errors.

Fixes: #8261
2018-02-27 19:59:09 +01:00
Alan Jenkins 2428aaf8a2 seccomp: allow x86-64 syscalls on x32, used by the VDSO (fix #8060)
The VDSO provided by the kernel for x32, uses x86-64 syscalls instead of
x32 ones.

I think we can safely allow this; the set of x86-64 syscalls should be
very similar to the x32 ones.  The real point is not to allow *x86*
syscalls, because some of those are inconveniently multiplexed and we're
apparently not able to block the specific actions we want to.
2018-02-02 18:12:34 +00:00
Alan Jenkins 5c19ff79de seccomp-util: fix alarming debug message (#8002, #8001)
Booting with `systemd.log_level=debug` and looking in `dmesg -u` showed
messages like this:

    systemd[433]: Failed to add rule for system call n/a() / 156, ignoring:
    Numerical argument out of domain

This commit fixes it to:

    systemd[449]: Failed to add rule for system call _sysctl() / 156,
    ignoring: Numerical argument out of domain

Some of the messages could be even more misleading, e.g. we were reporting
that utimensat() / 320 was skipped as non-existent on x86, when actually
the syscall number 320 is kexec_file_load() on x86 .

The problem was that syscall NRs are looked up (and correctly passed to
libseccomp) as native syscall NRs.  But we forgot that when we tried to
go back from the syscall NR to the name.

I think the natural way to write this would be
seccomp_syscall_resolve_num(nr), however there is no such function.
I couldn't work out a short comment that would make this clearer.  FWIW
I wrote it up as a ticket for libseccomp instead.
https://github.com/seccomp/libseccomp/issues/104
2018-01-31 17:20:14 +00:00
Lennart Poettering 7785da68e6
Merge pull request #7695 from yuwata/transient-socket
DBus-API: implement transient socket unit
2017-12-23 19:20:29 +01:00
Yu Watanabe 898748d8b9 core,seccomp: fix logic to parse syscall filter in dbus-execute.c
If multiple SystemCallFilter= settings, some of them are whitelist
and the others are blacklist, are sent to bus, then the parse
result was corrupted.
This fixes the parse logic, now it is the same as one used in
load-fragment.c
2017-12-23 18:45:32 +09:00
Mathieu Malaterre 63d00dfb64 shared/seccomp: add mmap handling for powerpc
Also remove the warning:

./src/shared/seccomp-util.c:1414:2: warning: #warning "Consider adding the right mmap() syscall definitions here!" [-Wcpp]
 #warning "Consider adding the right mmap() syscall definitions here!"
2017-12-22 15:30:03 +01:00
Lennart Poettering f1d34068ef tree-wide: add DEBUG_LOGGING macro that checks whether debug logging is on (#7645)
This makes things a bit easier to read I think, and also makes sure we
always use the _unlikely_ wrapper around it, which so far we used
sometimes and other times we didn't. Let's clean that up.
2017-12-15 11:09:00 +01:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Zbigniew Jędrzejewski-Szmek 91691f1d3e shared/seccomp: skip pkey_mprotect protections if the syscall is unknown
When compiling with an old kernel on architectures for which the
number is not defined in missing.h, a warning is generated in missing.h.
Let's just skip the protection in this case, to allow build to proceed.
2017-11-13 09:35:49 +01:00
Zbigniew Jędrzejewski-Szmek b835eeb4ec
shared/seccomp: disallow pkey_mprotect the same as mprotect for W^X mappings (#7295)
MemoryDenyWriteExecution policy could be be bypassed by using pkey_mprotect
instead of mprotect to create an executable writable mapping.

The impact is mitigated by the fact that the man page says "Note that this
feature is fully available on x86-64, and partially on x86", so hopefully
people do not rely on it as a sole security measure.

Found by Karin Hossen and Thomas Imbert from Sogeti ESEC R&D.

https://bugs.launchpad.net/bugs/1725348
2017-11-12 17:28:48 +01:00
Lennart Poettering ce5faeac1f seccomp: include ARM set_tls in @default (#7297)
Fixes: #7135
2017-11-12 16:34:43 +01:00
Yu Watanabe 8cfa775f4f core: add support to specify errno in SystemCallFilter=
This makes each system call in SystemCallFilter= blacklist optionally
takes errno name or number after a colon. The errno takes precedence
over the one given by SystemCallErrorNumber=.

C.f. #7173.
Closes #7169.
2017-11-11 21:54:12 +09:00
Antonio Rojas 8e6a7a8b2b Fix typo in statx macro (#7180)
This makes statx properly whitelisted in supported systems.
2017-11-10 11:07:36 +01:00
Lennart Poettering af0f047ba8 seccomp: port @privileged to use @reboot + @swap
Let's reuse two groups we already defined to make @privileged a bit
shorter.
2017-10-05 15:42:48 +02:00
Lennart Poettering e59608fa5f seccomp: there is no "kexec" syscall
it's called "kexec_load".
2017-10-05 15:42:48 +02:00
Lennart Poettering 44898c5358 seccomp: add three more seccomp groups
@aio → asynchronous IO calls
@sync → msync/fsync/... and friends
@chown → changing file ownership

(Also, change @privileged to reference @chown now, instead of the
individual syscalls it contains)
2017-10-05 15:42:48 +02:00
Djalal Harouni 7c72bab4e3 seccomp: remove 'gettid' syscall from '@process' syscall set (#6989)
The gettid syscall is one of the most basic syscalls, it never fails and
it operates on current thread. Most applications are not suposed to use
it, however even if it is used there is no much justification on blocking
it. This patch removes it from '@process' set so if users blacklist this
set to block setns or clone syscalls, the gettid syscall will still be
available. Of course they can always block gettid explicitly.

Note that the gettid is already in the '@default' set.
2017-10-05 14:46:41 +02:00
Lennart Poettering 448ac526a3 seccomp: ignore (and debug log) errors by all invocations of seccomp_rule_add_exact()
System calls might exist on some archs but not on others, or might be
multiplexed but not on others. Ignore such errors when putting together
a filter at this location like we already do it on all others.
2017-10-05 11:27:34 +02:00
Lennart Poettering 1c6af69b2d seccomp: always handle seccomp_load() failing the same way
Unfortunately libseccomp doesn't return (nor document) clean error
codes, hence until then only check for specific error codes that we
propagate, but ignore (but debug log) all others. Do this at one more
place, we are already doing that at all others.
2017-10-05 11:27:34 +02:00
Lennart Poettering ff217dc3af seccomp: react gracefully if we can't translate a syscall name
When a libseccomp implementation doesn't know a syscall yet, that's no
reason for us to fail completely. Instead, debug log, and proceed.

This hopefully fixes the preadv2/pwritev2 issues pointed out here:

https://github.com/systemd/systemd/pull/6952#issuecomment-334302923
2017-10-05 11:27:34 +02:00
Lennart Poettering 4c3a917617 seccomp: include prlimit64 and ugetrlimit in @default
Also, move prlimit64() out of @resources.

prlimit64() may be used both for getting and setting resource limits, and
is implicitly called by glibc at various places, on some archs, the same
was as getrlimit(). SImilar, igetrlimit() is an arch-specific
replacement for getrlimit(), and hence should be whitelisted at the same
place as getrlimit() and prlimit64().

Also see: https://lists.freedesktop.org/archives/systemd-devel/2017-September/039543.html
2017-10-05 11:27:34 +02:00
Djalal Harouni 8f44de08e9 seccomp: add sched_yield syscall to the @default syscall set 2017-10-04 10:41:42 +01:00
Djalal Harouni 09d3020b0a seccomp: remove '@credentials' syscall set (#6958)
This removes the '@credentials' syscall set that was added in commit
v234-468-gcd0ddf6f75.

Most of these syscalls are so simple that we do not want to filter them.
They work on the current calling process, doing only read operations,
they do not have a deep kernel path.

The problem may only be in 'capget' syscall since it can query arbitrary
processes, and used to discover processes, however sending signal 0 to
arbitrary processes can be used to discover if a process exists or not.
It is unfortunate that Linux allows to query processes of different
users. Lets put it now in '@process' syscall set, and later we may add
it to a new '@basic-process' set that allows most basic process
operations.
2017-10-03 07:20:05 +02:00
Lennart Poettering cff7bff880 seccomp: improve debug logging
Let's log explicitly at debug level if we encounter a syscall or group
that doesn#t exist at all.
2017-09-14 15:45:21 +02:00
Lennart Poettering cd0ddf6f75 seccomp: add four new syscall groups
These groups should be useful shortcuts for sets of closely related
syscalls where it usually makes more sense to allow them altogether or
not at all.
2017-09-14 15:45:21 +02:00
Lennart Poettering 0963c053fa seccomp: augment the @resources group a bit
Given that sched_setattr/sched_setparam/sched_setscheduler are already
in the group the closely related nice + ioprio_set should also be
included.

Also, order things alphabetically.
2017-09-14 15:45:21 +02:00
Lennart Poettering b887d2ebfe seccomp: beef up @process group a bit
Include the waid syscalls. If we permit forking then we should also
permit waiting for a process.

Similar to that: also permit determining the usage counters for
processes.

Include calls to determine process/thread identity. They have little
impact security-wise, but are very likely used when process management
of any form is done.

Also, add rt_sigqueueinfo + rt_tgsigqueueinfo as they are similar to
kill() and friends, but permit passing along a userdata ptr.
2017-09-14 15:45:21 +02:00
Lennart Poettering 7e0c3b8fda seccomp: "idle" is another obsolete syscall 2017-09-14 15:45:21 +02:00
Lennart Poettering 215728ff39 seccomp: order the syscalls in more groups alphabetically
No changes besides reordering.
2017-09-14 15:45:21 +02:00
Lennart Poettering ceaa6aa76b seccomp: let's update @file-system a bit
Let's add fremovexattr which was the only xattr syscall so far missing
from the group, even though lremovexattr and friends where included.

Add inotify_init, which is an older (but still supported) version of
inotify_init1.

Add oldfstat, oldlstat, oldstat which are old versions of the stat
syscalls on some archs.

Add utime, which is an older more limited version of utimes and
utimensat.

Enclose the "statx" entry in some ifdeffery to ensure libseccomp
actually knows the syscall. If libseccomp doesn't know it, then we'd get
EINVAL rather than EDOM (which is what is returned if a syscall is known
but not available on the local system) when resolving the syscall name
and we really don't want that, as we use the EDOM vs. EINVAL check for
determining whether a syscall makes sense at all.

Also, order things alphabetically.
2017-09-14 15:45:21 +02:00
Lennart Poettering 648a0ed0d7 seccomp: let's update base-io a bit
Let's add _llseek which is the syscall name on some archs that on others
is simply lseek (due to 64bit vs 32bit off_t confusion). Also, let's
sort things alphabetically.
2017-09-14 15:45:21 +02:00
Lennart Poettering e41b0f42a8 seccomp: update "@default" seccomp group a bit
Let's add more of the most basic operations to "@default" as absolute
baseline needed by glibc and such to operate. Specifically:

futex, get_robust_list, get_thread_area, membarrier, set_robust_list,
set_thread_area, set_tid_address are all required to properly implement
mutexes and other thread synchronization logic. Given that a ton of
datastructures are protected by mutexes (such as stdio and such), let's
just whitelist this by default, so that things can just work.

restart_syscall is used to implement EAGAIN SA_RESTART stuff in some
archs, and synthesized by the kernel without any explicit user logic,
hence let's make this work out of the box.
2017-09-14 15:45:21 +02:00
Lennart Poettering 960e4569e1 nspawn: implement configurable syscall whitelisting/blacklisting
Now that we have ported nspawn's seccomp code to the generic code in
seccomp-util, let's extend it to support whitelisting and blacklisting
of specific additional syscalls.

This uses similar syntax as PID1's support for system call filtering,
but in contrast to that always implements a blacklist (and not a
whitelist), as we prepopulate the filter with a blacklist, and the
unit's system call filter logic does not come with anything
prepopulated.

(Later on we might actually want to invert the logic here, and
whitelist rather than blacklist things, but at this point let's not do
that. In case we switch this over later, the syscall add/remove logic of
this commit should be compatible conceptually.)

Fixes: #5163

Replaces: #5944
2017-09-12 14:06:21 +02:00
Lennart Poettering 69b1b241bb seccomp: split out inner loop code of seccomp_add_syscall_filter_set()
Let's add a new helper function seccomp_add_syscall_filter_item() that
contains the inner loop code of seccomp_add_syscall_filter_set(). This
helper function we can then export and make use of elsewhere.
2017-09-11 18:00:07 +02:00
Lennart Poettering 12dc378902 seccomp: drop default_action parameter from seccomp_add_syscall_filter_set()
The function doesn't actually use the parameter, hence let's drop it.
2017-09-11 18:00:07 +02:00
Lennart Poettering a4135a742d shared: add statx(2) to @file-system syscall filter list (#6738) 2017-09-04 15:35:35 +02:00
Lennart Poettering 72eafe7159 seccomp: rework seccomp_lock_personality() to apply filter to all archs 2017-08-29 15:58:13 +02:00
Topi Miettinen 78e864e5b3 seccomp: LockPersonality boolean (#6193)
Add LockPersonality boolean to allow locking down personality(2)
system call so that the execution domain can't be changed.
This may be useful to improve security because odd emulations
may be poorly tested and source of vulnerabilities, while
system services shouldn't need any weird personalities.
2017-08-29 15:54:50 +02:00
Lennart Poettering 165a31c0db core: add two new special ExecStart= character prefixes
This patch adds two new special character prefixes to ExecStart= and
friends, in addition to the existing "-", "@" and "+":

"!"  → much like "+", except with a much reduced effect as it only
       disables the actual setresuid()/setresgid()/setgroups() calls, but
       leaves all other security features on, including namespace
       options. This is very useful in combination with
       RuntimeDirectory= or DynamicUser= and similar option, as a user
       is still allocated and used for the runtime directory, but the
       actual UID/GID dropping is left to the daemon process itself.
       This should make RuntimeDirectory= a lot more useful for daemons
       which insist on doing their own privilege dropping.

"!!" → Similar to "!", but on systems supporting ambient caps this
       becomes a NOP. This makes it relatively straightforward to write
       unit files that make use of ambient capabilities to let systemd
       drop all privs while retaining compatibility with systems that
       lack ambient caps, where priv dropping is the left to the daemon
       codes themselves.

This is an alternative approach to #6564 and related PRs.
2017-08-10 15:04:32 +02:00
Lennart Poettering 6eaaeee93a seccomp: add new @setuid seccomp group
This new group lists all UID/GID credential changing syscalls (which are
quite a number these days). This will become particularly useful in a
later commit, which uses this group to optionally permit user credential
changing to daemons in case ambient capabilities are not available.
2017-08-10 15:02:50 +02:00
Yu Watanabe b16bd5350f seccomp-util: add parse_syscall_archs() 2017-08-07 23:41:52 +09:00