Commit graph

69 commits

Author SHA1 Message Date
Lennart Poettering 4c3a917617 seccomp: include prlimit64 and ugetrlimit in @default
Also, move prlimit64() out of @resources.

prlimit64() may be used both for getting and setting resource limits, and
is implicitly called by glibc at various places, on some archs, the same
was as getrlimit(). SImilar, igetrlimit() is an arch-specific
replacement for getrlimit(), and hence should be whitelisted at the same
place as getrlimit() and prlimit64().

Also see: https://lists.freedesktop.org/archives/systemd-devel/2017-September/039543.html
2017-10-05 11:27:34 +02:00
Djalal Harouni 8f44de08e9 seccomp: add sched_yield syscall to the @default syscall set 2017-10-04 10:41:42 +01:00
Djalal Harouni 09d3020b0a seccomp: remove '@credentials' syscall set (#6958)
This removes the '@credentials' syscall set that was added in commit
v234-468-gcd0ddf6f75.

Most of these syscalls are so simple that we do not want to filter them.
They work on the current calling process, doing only read operations,
they do not have a deep kernel path.

The problem may only be in 'capget' syscall since it can query arbitrary
processes, and used to discover processes, however sending signal 0 to
arbitrary processes can be used to discover if a process exists or not.
It is unfortunate that Linux allows to query processes of different
users. Lets put it now in '@process' syscall set, and later we may add
it to a new '@basic-process' set that allows most basic process
operations.
2017-10-03 07:20:05 +02:00
Lennart Poettering cff7bff880 seccomp: improve debug logging
Let's log explicitly at debug level if we encounter a syscall or group
that doesn#t exist at all.
2017-09-14 15:45:21 +02:00
Lennart Poettering cd0ddf6f75 seccomp: add four new syscall groups
These groups should be useful shortcuts for sets of closely related
syscalls where it usually makes more sense to allow them altogether or
not at all.
2017-09-14 15:45:21 +02:00
Lennart Poettering 0963c053fa seccomp: augment the @resources group a bit
Given that sched_setattr/sched_setparam/sched_setscheduler are already
in the group the closely related nice + ioprio_set should also be
included.

Also, order things alphabetically.
2017-09-14 15:45:21 +02:00
Lennart Poettering b887d2ebfe seccomp: beef up @process group a bit
Include the waid syscalls. If we permit forking then we should also
permit waiting for a process.

Similar to that: also permit determining the usage counters for
processes.

Include calls to determine process/thread identity. They have little
impact security-wise, but are very likely used when process management
of any form is done.

Also, add rt_sigqueueinfo + rt_tgsigqueueinfo as they are similar to
kill() and friends, but permit passing along a userdata ptr.
2017-09-14 15:45:21 +02:00
Lennart Poettering 7e0c3b8fda seccomp: "idle" is another obsolete syscall 2017-09-14 15:45:21 +02:00
Lennart Poettering 215728ff39 seccomp: order the syscalls in more groups alphabetically
No changes besides reordering.
2017-09-14 15:45:21 +02:00
Lennart Poettering ceaa6aa76b seccomp: let's update @file-system a bit
Let's add fremovexattr which was the only xattr syscall so far missing
from the group, even though lremovexattr and friends where included.

Add inotify_init, which is an older (but still supported) version of
inotify_init1.

Add oldfstat, oldlstat, oldstat which are old versions of the stat
syscalls on some archs.

Add utime, which is an older more limited version of utimes and
utimensat.

Enclose the "statx" entry in some ifdeffery to ensure libseccomp
actually knows the syscall. If libseccomp doesn't know it, then we'd get
EINVAL rather than EDOM (which is what is returned if a syscall is known
but not available on the local system) when resolving the syscall name
and we really don't want that, as we use the EDOM vs. EINVAL check for
determining whether a syscall makes sense at all.

Also, order things alphabetically.
2017-09-14 15:45:21 +02:00
Lennart Poettering 648a0ed0d7 seccomp: let's update base-io a bit
Let's add _llseek which is the syscall name on some archs that on others
is simply lseek (due to 64bit vs 32bit off_t confusion). Also, let's
sort things alphabetically.
2017-09-14 15:45:21 +02:00
Lennart Poettering e41b0f42a8 seccomp: update "@default" seccomp group a bit
Let's add more of the most basic operations to "@default" as absolute
baseline needed by glibc and such to operate. Specifically:

futex, get_robust_list, get_thread_area, membarrier, set_robust_list,
set_thread_area, set_tid_address are all required to properly implement
mutexes and other thread synchronization logic. Given that a ton of
datastructures are protected by mutexes (such as stdio and such), let's
just whitelist this by default, so that things can just work.

restart_syscall is used to implement EAGAIN SA_RESTART stuff in some
archs, and synthesized by the kernel without any explicit user logic,
hence let's make this work out of the box.
2017-09-14 15:45:21 +02:00
Lennart Poettering 960e4569e1 nspawn: implement configurable syscall whitelisting/blacklisting
Now that we have ported nspawn's seccomp code to the generic code in
seccomp-util, let's extend it to support whitelisting and blacklisting
of specific additional syscalls.

This uses similar syntax as PID1's support for system call filtering,
but in contrast to that always implements a blacklist (and not a
whitelist), as we prepopulate the filter with a blacklist, and the
unit's system call filter logic does not come with anything
prepopulated.

(Later on we might actually want to invert the logic here, and
whitelist rather than blacklist things, but at this point let's not do
that. In case we switch this over later, the syscall add/remove logic of
this commit should be compatible conceptually.)

Fixes: #5163

Replaces: #5944
2017-09-12 14:06:21 +02:00
Lennart Poettering 69b1b241bb seccomp: split out inner loop code of seccomp_add_syscall_filter_set()
Let's add a new helper function seccomp_add_syscall_filter_item() that
contains the inner loop code of seccomp_add_syscall_filter_set(). This
helper function we can then export and make use of elsewhere.
2017-09-11 18:00:07 +02:00
Lennart Poettering 12dc378902 seccomp: drop default_action parameter from seccomp_add_syscall_filter_set()
The function doesn't actually use the parameter, hence let's drop it.
2017-09-11 18:00:07 +02:00
Lennart Poettering a4135a742d shared: add statx(2) to @file-system syscall filter list (#6738) 2017-09-04 15:35:35 +02:00
Lennart Poettering 72eafe7159 seccomp: rework seccomp_lock_personality() to apply filter to all archs 2017-08-29 15:58:13 +02:00
Topi Miettinen 78e864e5b3 seccomp: LockPersonality boolean (#6193)
Add LockPersonality boolean to allow locking down personality(2)
system call so that the execution domain can't be changed.
This may be useful to improve security because odd emulations
may be poorly tested and source of vulnerabilities, while
system services shouldn't need any weird personalities.
2017-08-29 15:54:50 +02:00
Lennart Poettering 165a31c0db core: add two new special ExecStart= character prefixes
This patch adds two new special character prefixes to ExecStart= and
friends, in addition to the existing "-", "@" and "+":

"!"  → much like "+", except with a much reduced effect as it only
       disables the actual setresuid()/setresgid()/setgroups() calls, but
       leaves all other security features on, including namespace
       options. This is very useful in combination with
       RuntimeDirectory= or DynamicUser= and similar option, as a user
       is still allocated and used for the runtime directory, but the
       actual UID/GID dropping is left to the daemon process itself.
       This should make RuntimeDirectory= a lot more useful for daemons
       which insist on doing their own privilege dropping.

"!!" → Similar to "!", but on systems supporting ambient caps this
       becomes a NOP. This makes it relatively straightforward to write
       unit files that make use of ambient capabilities to let systemd
       drop all privs while retaining compatibility with systems that
       lack ambient caps, where priv dropping is the left to the daemon
       codes themselves.

This is an alternative approach to #6564 and related PRs.
2017-08-10 15:04:32 +02:00
Lennart Poettering 6eaaeee93a seccomp: add new @setuid seccomp group
This new group lists all UID/GID credential changing syscalls (which are
quite a number these days). This will become particularly useful in a
later commit, which uses this group to optionally permit user credential
changing to daemons in case ambient capabilities are not available.
2017-08-10 15:02:50 +02:00
Yu Watanabe b16bd5350f seccomp-util: add parse_syscall_archs() 2017-08-07 23:41:52 +09:00
Zbigniew Jędrzejewski-Szmek 79873bc850 seccomp: arm64 does not have mmap2
I messed up when adding the definitions in 4278d1f531.
Unfortunately I didn't have the hardware at hand and went by
looking at the kernel headers.

(cherry picked from commit 53196fafcb7b24b45ed4f48ab894d00a24a6d871)
2017-07-15 17:18:22 -04:00
Zbigniew Jędrzejewski-Szmek 2e64e8f46d seccomp: arm64/x32 do not have _sysctl
So don't even try to added the filter to reduce noise.
The test is updated to skip calling _sysctl because the kernel prints
an oops-like message that is confusing and unhelpful:

Jul 15 21:07:01 rpi3 kernel: test-seccomp[8448]: syscall -10080
Jul 15 21:07:01 rpi3 kernel: Code: aa0503e4 aa0603e5 aa0703e6 d4000001 (b13ffc1f)
Jul 15 21:07:01 rpi3 kernel: CPU: 3 PID: 8448 Comm: test-seccomp Tainted: G        W       4.11.8-300.fc26.aarch64 #1
Jul 15 21:07:01 rpi3 kernel: Hardware name: raspberrypi rpi/rpi, BIOS 2017.05 06/24/2017
Jul 15 21:07:01 rpi3 kernel: task: ffff80002bb0bb00 task.stack: ffff800036354000
Jul 15 21:07:01 rpi3 kernel: PC is at 0xffff8669c7c4
Jul 15 21:07:01 rpi3 kernel: LR is at 0xaaaac64b6750
Jul 15 21:07:01 rpi3 kernel: pc : [<0000ffff8669c7c4>] lr : [<0000aaaac64b6750>] pstate: 60000000
Jul 15 21:07:01 rpi3 kernel: sp : 0000ffffdc640fd0
Jul 15 21:07:01 rpi3 kernel: x29: 0000ffffdc640fd0 x28: 0000000000000000
Jul 15 21:07:01 rpi3 kernel: x27: 0000000000000000 x26: 0000000000000000
Jul 15 21:07:01 rpi3 kernel: x25: 0000000000000000 x24: 0000000000000000
Jul 15 21:07:01 rpi3 kernel: x23: 0000000000000000 x22: 0000000000000000
Jul 15 21:07:01 rpi3 kernel: x21: 0000aaaac64b4940 x20: 0000000000000000
Jul 15 21:07:01 rpi3 kernel: x19: 0000aaaac64b88f8 x18: 0000000000000020
Jul 15 21:07:01 rpi3 kernel: x17: 0000ffff8669c7a0 x16: 0000aaaac64d2ee0
Jul 15 21:07:01 rpi3 kernel: x15: 0000000000000000 x14: 0000000000000000
Jul 15 21:07:01 rpi3 kernel: x13: 203a657275746365 x12: 0000000000000000
Jul 15 21:07:01 rpi3 kernel: x11: 0000ffffdc640418 x10: 0000000000000000
Jul 15 21:07:01 rpi3 kernel: x9 : 0000000000000005 x8 : 00000000ffffd8a0
Jul 15 21:07:01 rpi3 kernel: x7 : 7f7f7f7f7f7f7f7f x6 : 7f7f7f7f7f7f7f7f
Jul 15 21:07:01 rpi3 kernel: x5 : 65736d68716f7277 x4 : 0000000000000000
Jul 15 21:07:01 rpi3 kernel: x3 : 0000000000000008 x2 : 0000000000000000
Jul 15 21:07:01 rpi3 kernel: x1 : 0000000000000000 x0 : 0000000000000000
Jul 15 21:07:01 rpi3 kernel:

(cherry picked from commit 1e20e640132c700c23494bb9e2619afb83878380)
2017-07-15 17:18:22 -04:00
Zbigniew Jędrzejewski-Szmek e7854c46be shared/seccomp-util: add parentheses and no. after syscall name
"Failed to add rule for system call access, ignoring: Numerical argument out of domain"
is confusing. Make that "... system call access() / 238".

(cherry picked from commit 977dc6ca5acb8069a2966ec63e7378576bc2ca51)
2017-07-15 17:18:22 -04:00
Zbigniew Jędrzejewski-Szmek da1921a5c3 seccomp: enable RestrictAddressFamilies on ppc64, autodetect SECCOMP_RESTRICT_ADDRESS_FAMILIES_BROKEN
We expect that if socket() syscall is available, seccomp works for that
architecture.  So instead of explicitly listing all architectures where we know
it is not available, just assume it is broken if the number is not defined.
This should have the same effect, except that other architectures where it is
also broken will pass tests without further changes. (Architectures where the
filter should work, but does not work because of missing entries in
seccomp-util.c, will still fail.)

i386, s390, s390x are the exception — setting the filter fails, even though
socket() is available, so it needs to be special-cased
(https://github.com/systemd/systemd/issues/5215#issuecomment-277241488).

This remove the last define in seccomp-util.h that was only used in test-seccomp.c. Porting
the seccomp filter to new architectures should be simpler because now only two places need
to be modified.

RestrictAddressFamilies seems to work on ppc64[bl]e, so enable it (the tests pass).
2017-05-10 09:21:16 -04:00
Zbigniew Jędrzejewski-Szmek 511ceb1f8d seccomp: assume clone() arg order is known on all architectures
While adding the defines for arm, I realized that we have pretty much all
known architectures covered, so SECCOMP_RESTRICT_NAMESPACES_BROKEN is not
necessary anymore. clone(2) is adamant that the order of the first two
arguments is only reversed on s390/s390x. So let's simplify things and remove
the #if.
2017-05-07 20:01:04 -04:00
Zbigniew Jędrzejewski-Szmek 4278d1f531 seccomp: add mmap/shmat defines for arm and arm64 2017-05-07 20:01:04 -04:00
Zbigniew Jędrzejewski-Szmek 2a8d6e6395 seccomp: add mmap/shmat defines for ppc64 2017-05-07 20:01:04 -04:00
Zbigniew Jędrzejewski-Szmek 6dc666886a seccomp: factor out seccomp_rule_add_exact to a helper function 2017-05-07 19:01:11 -04:00
James Cowgill a3645cc6dd seccomp: add clone syscall definitions for mips (#5880)
Also updates the documentation and adds a mention of ppc64 support
which was enabled by #5325.

Tested on Debian mipsel and mips64el. The other 4 mips architectures
should have an identical user <-> kernel ABI to one of the 2 tested
systems.
2017-05-03 18:35:45 +02:00
Zbigniew Jędrzejewski-Szmek 290f0ff9aa Define clone order on ppc (#5325)
This was tested on ppc64le. Assume the same is true for ppc64.
2017-02-14 11:27:40 +01:00
Lennart Poettering 9606bc4b4b seccomp: disable RestrictAddressFamilies= for the ABI we shall block, not the one we are compiled for (#5272)
It's a difference. Not a big one, but let's be correct here.
2017-02-12 15:25:40 -05:00
Lennart Poettering f2d9751c59 seccomp: order seccomp ABI list, so that our native ABI comes last (#5306)
this way, we can still call seccomp ourselves, even if seccomp() is
blocked by the filter we are installing.

Fixes: #5300
2017-02-10 23:47:50 +01:00
Lennart Poettering 7961116e2c seccomp: add forgotten munmap() syscall to @file-system (#5291)
We added mmap() and mmap2(), but forgot munmap(). Fix that.

Pointed out by @lucaswerkmeister:

https://github.com/systemd/systemd/pull/4537#issuecomment-273275298
2017-02-09 21:29:33 -05:00
Lennart Poettering ae9d60ce4e seccomp: on s390 the clone() parameters are reversed
Add a bit of code that tries to get the right parameter order in place
for some of the better known architectures, and skips
restrict_namespaces for other archs.

This also bypasses the test on archs where we don't know the right
order.

In this case I didn't bother with testing the case where no filter is
applied, since that is hopefully just an issue for now, as there's
nothing stopping us from supporting more archs, we just need to know
which order is right.

Fixes: #5241
2017-02-08 22:21:27 +01:00
Lennart Poettering 8a50cf6957 seccomp: MemoryDenyWriteExecute= should affect both mmap() and mmap2() (#5254)
On i386 we block the old mmap() call entirely, since we cannot properly
filter it. Thankfully it hasn't been used by glibc since quite some
time.

Fixes: #5240
2017-02-08 15:14:02 +01:00
Lennart Poettering ad8f1479b4 seccomp: RestrictAddressFamilies= is not supported on i386/s390/s390x, make it a NOP
See: #5215
2017-02-06 14:17:12 +01:00
Evgeny Vereshchagin 1b52793d5d seccomp: don't ever try to add an ABI before removing the default native ABI (#5230)
https://github.com/systemd/systemd/issues/5215#issuecomment-277156262

libseccomp does not allow you to add architectures to a filter that
doesn't match the byte ordering of the architectures already added to
the filter (it would be a mess, not to mention largely pointless) and
since systemd attempts to add an ABI before removing the default native
ABI, you will always fail on Power (either due to ppc or ppc64le). The
fix is to remove the native ABI before adding a new ABI so you don't run
into problems with byte ordering.

You would likely see the same failure on a MIPS system.

Thanks @pcmoore!
2017-02-05 11:58:19 -05:00
Lennart Poettering 4d5bd50ab2 seccomp: minor simplifications for is_seccomp_available() 2017-01-17 22:14:27 -05:00
Lennart Poettering 469830d142 seccomp: rework seccomp code, to improve compat with some archs
This substantially reworks the seccomp code, to ensure better
compatibility with some architectures, including i386.

So far we relied on libseccomp's internal handling of the multiple
syscall ABIs supported on Linux. This is problematic however, as it does
not define clear semantics if an ABI is not able to support specific
seccomp rules we install.

This rework hence changes a couple of things:

- We no longer use seccomp_rule_add(), but only
  seccomp_rule_add_exact(), and fail the installation of a filter if the
  architecture doesn't support it.

- We no longer rely on adding multiple syscall architectures to a single filter,
  but instead install a separate filter for each syscall architecture
  supported. This way, we can install a strict filter for x86-64, while
  permitting a less strict filter for i386.

- All high-level filter additions are now moved from execute.c to
  seccomp-util.c, so that we can test them independently of the service
  execution logic.

- Tests have been added for all types of our seccomp filters.

- SystemCallFilters= and SystemCallArchitectures= are now implemented in
  independent filters and installation logic, as they semantically are
  very much independent of each other.

Fixes: #4575
2017-01-17 22:14:27 -05:00
Lennart Poettering 802fa07a4a seccomp: move bdflush() system call to @obsolete filter group
The system call is obsolete after all.
2016-12-27 18:09:37 +01:00
Lennart Poettering 58a8f68be0 seccomp: add proper help string for @resources seccomp filter set 2016-12-27 18:09:37 +01:00
Lennart Poettering bd2ab3f4f6 seccomp: add two new filter sets: @reboot and @swap
These groupe reboot()/kexec() and swapon()/swapoff() respectively
2016-12-27 18:09:37 +01:00
Lennart Poettering 1a1b13c957 seccomp: add @filesystem syscall group (#4537)
@filesystem groups various file system operations, such as opening files and
directories for read/write and stat()ing them, plus renaming, deleting,
symlinking, hardlinking.
2016-11-21 19:29:12 -05:00
Lennart Poettering add005357d core: add new RestrictNamespaces= unit file setting
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().

RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.

This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
2016-11-04 07:40:13 -06:00
Zbigniew Jędrzejewski-Szmek d5efc18b60 seccomp-util, analyze: export comments as a help string
Just to make the whole thing easier for users.
2016-11-03 09:35:36 -04:00
Zbigniew Jędrzejewski-Szmek 40eb6a8014 seccomp-util: move @default to the first position
Now that the list is user-visible, @default should be first.
2016-11-03 09:35:36 -04:00
Lennart Poettering 133ddbbeae seccomp: add two new syscall groups
@resources contains various syscalls that alter resource limits and memory and
scheduling parameters of processes. As such they are good candidates to block
for most services.

@basic-io contains a number of basic syscalls for I/O, similar to the list
seccomp v1 permitted but slightly more complete. It should be useful for
building basic whitelisting for minimal sandboxes
2016-11-02 08:50:00 -06:00
Lennart Poettering cd5bfd7e60 seccomp: include pipes and memfd in @ipc
These system calls clearly fall in the @ipc category, hence should be listed
there, simply to avoid confusion and surprise by the user.
2016-11-02 08:50:00 -06:00
Lennart Poettering a8c157ff30 seccomp: drop execve() from @process list
The system call is already part in @default hence implicitly allowed anyway.
Also, if it is actually blocked then systemd couldn't execute the service in
question anymore, since the application of seccomp is immediately followed by
it.
2016-11-02 08:49:59 -06:00