Commit graph

26 commits

Author SHA1 Message Date
Zbigniew Jędrzejewski-Szmek da1921a5c3 seccomp: enable RestrictAddressFamilies on ppc64, autodetect SECCOMP_RESTRICT_ADDRESS_FAMILIES_BROKEN
We expect that if socket() syscall is available, seccomp works for that
architecture.  So instead of explicitly listing all architectures where we know
it is not available, just assume it is broken if the number is not defined.
This should have the same effect, except that other architectures where it is
also broken will pass tests without further changes. (Architectures where the
filter should work, but does not work because of missing entries in
seccomp-util.c, will still fail.)

i386, s390, s390x are the exception — setting the filter fails, even though
socket() is available, so it needs to be special-cased
(https://github.com/systemd/systemd/issues/5215#issuecomment-277241488).

This remove the last define in seccomp-util.h that was only used in test-seccomp.c. Porting
the seccomp filter to new architectures should be simpler because now only two places need
to be modified.

RestrictAddressFamilies seems to work on ppc64[bl]e, so enable it (the tests pass).
2017-05-10 09:21:16 -04:00
Zbigniew Jędrzejewski-Szmek 511ceb1f8d seccomp: assume clone() arg order is known on all architectures
While adding the defines for arm, I realized that we have pretty much all
known architectures covered, so SECCOMP_RESTRICT_NAMESPACES_BROKEN is not
necessary anymore. clone(2) is adamant that the order of the first two
arguments is only reversed on s390/s390x. So let's simplify things and remove
the #if.
2017-05-07 20:01:04 -04:00
Zbigniew Jędrzejewski-Szmek 2a65bd94e4 seccomp: drop SECCOMP_MEMORY_DENY_WRITE_EXECUTE_BROKEN, add test for shmat
SECCOMP_MEMORY_DENY_WRITE_EXECUTE_BROKEN was conflating two separate things:
1. whether shmat/shmdt/shmget can be filtered (if ipc multiplexer is used, they can not)
2. whether we know this for the current architecture

For i386, shmat is implemented as ipc, so seccomp filter is "broken" for shmat,
but not for mmap, and SECCOMP_MEMORY_DENY_WRITE_EXECUTE_BROKEN cannot be used
to cover both cases. The define was only used for tests — not in the implementation
in seccomp-util.c. So let's get rid of SECCOMP_MEMORY_DENY_WRITE_EXECUTE_BROKEN
and encode the right condition directly in tests.
2017-05-07 18:59:37 -04:00
James Cowgill a3645cc6dd seccomp: add clone syscall definitions for mips (#5880)
Also updates the documentation and adds a mention of ppc64 support
which was enabled by #5325.

Tested on Debian mipsel and mips64el. The other 4 mips architectures
should have an identical user <-> kernel ABI to one of the 2 tested
systems.
2017-05-03 18:35:45 +02:00
Zbigniew Jędrzejewski-Szmek 290f0ff9aa Define clone order on ppc (#5325)
This was tested on ppc64le. Assume the same is true for ppc64.
2017-02-14 11:27:40 +01:00
Lennart Poettering ae9d60ce4e seccomp: on s390 the clone() parameters are reversed
Add a bit of code that tries to get the right parameter order in place
for some of the better known architectures, and skips
restrict_namespaces for other archs.

This also bypasses the test on archs where we don't know the right
order.

In this case I didn't bother with testing the case where no filter is
applied, since that is hopefully just an issue for now, as there's
nothing stopping us from supporting more archs, we just need to know
which order is right.

Fixes: #5241
2017-02-08 22:21:27 +01:00
Lennart Poettering 8a50cf6957 seccomp: MemoryDenyWriteExecute= should affect both mmap() and mmap2() (#5254)
On i386 we block the old mmap() call entirely, since we cannot properly
filter it. Thankfully it hasn't been used by glibc since quite some
time.

Fixes: #5240
2017-02-08 15:14:02 +01:00
Lennart Poettering ad8f1479b4 seccomp: RestrictAddressFamilies= is not supported on i386/s390/s390x, make it a NOP
See: #5215
2017-02-06 14:17:12 +01:00
Lennart Poettering 469830d142 seccomp: rework seccomp code, to improve compat with some archs
This substantially reworks the seccomp code, to ensure better
compatibility with some architectures, including i386.

So far we relied on libseccomp's internal handling of the multiple
syscall ABIs supported on Linux. This is problematic however, as it does
not define clear semantics if an ABI is not able to support specific
seccomp rules we install.

This rework hence changes a couple of things:

- We no longer use seccomp_rule_add(), but only
  seccomp_rule_add_exact(), and fail the installation of a filter if the
  architecture doesn't support it.

- We no longer rely on adding multiple syscall architectures to a single filter,
  but instead install a separate filter for each syscall architecture
  supported. This way, we can install a strict filter for x86-64, while
  permitting a less strict filter for i386.

- All high-level filter additions are now moved from execute.c to
  seccomp-util.c, so that we can test them independently of the service
  execution logic.

- Tests have been added for all types of our seccomp filters.

- SystemCallFilters= and SystemCallArchitectures= are now implemented in
  independent filters and installation logic, as they semantically are
  very much independent of each other.

Fixes: #4575
2017-01-17 22:14:27 -05:00
Lennart Poettering bd2ab3f4f6 seccomp: add two new filter sets: @reboot and @swap
These groupe reboot()/kexec() and swapon()/swapoff() respectively
2016-12-27 18:09:37 +01:00
Lennart Poettering 1a1b13c957 seccomp: add @filesystem syscall group (#4537)
@filesystem groups various file system operations, such as opening files and
directories for read/write and stat()ing them, plus renaming, deleting,
symlinking, hardlinking.
2016-11-21 19:29:12 -05:00
Lennart Poettering add005357d core: add new RestrictNamespaces= unit file setting
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().

RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.

This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
2016-11-04 07:40:13 -06:00
Zbigniew Jędrzejewski-Szmek d5efc18b60 seccomp-util, analyze: export comments as a help string
Just to make the whole thing easier for users.
2016-11-03 09:35:36 -04:00
Zbigniew Jędrzejewski-Szmek 40eb6a8014 seccomp-util: move @default to the first position
Now that the list is user-visible, @default should be first.
2016-11-03 09:35:36 -04:00
Lennart Poettering 133ddbbeae seccomp: add two new syscall groups
@resources contains various syscalls that alter resource limits and memory and
scheduling parameters of processes. As such they are good candidates to block
for most services.

@basic-io contains a number of basic syscalls for I/O, similar to the list
seccomp v1 permitted but slightly more complete. It should be useful for
building basic whitelisting for minimal sandboxes
2016-11-02 08:50:00 -06:00
Lennart Poettering f6281133de seccomp: add test-seccomp test tool
This validates the system call set table and many of our seccomp-util.c APIs.
2016-10-24 17:32:51 +02:00
Lennart Poettering a3be2849b2 seccomp: add new helper call seccomp_load_filter_set()
This allows us to unify most of the code in apply_protect_kernel_modules() and
apply_private_devices().
2016-10-24 17:32:50 +02:00
Lennart Poettering 8d7b0c8fd7 seccomp: add new seccomp_init_conservative() helper
This adds a new seccomp_init_conservative() helper call that is mostly just a
wrapper around seccomp_init(), but turns off NNP and adds in all secondary
archs, for best compatibility with everything else.

Pretty much all of our code used the very same constructs for these three
steps, hence unifying this in one small function makes things a lot shorter.

This also changes incorrect usage of the "scmp_filter_ctx" type at various
places. libseccomp defines it as typedef to "void*", i.e. it is a pointer type
(pretty poor choice already!) that casts implicitly to and from all other
pointer types (even poorer choice: you defined a confusing type now, and don't
even gain any bit of type safety through it...). A lot of the code assumed the
type would refer to a structure, and hence aded additional "*" here and there.
Remove that.
2016-10-24 17:32:50 +02:00
Lennart Poettering 8130926d32 core: rework syscall filter set handling
A variety of fixes:

- rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main
  instance of it (the syscall_filter_sets[] array) used to abbreviate
  "SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not
  mix and match too wildly. Let's pick the shorter name in this case, as it is
  sufficiently well established to not confuse hackers reading this.

- Export explicit indexes into the syscall_filter_sets[] array via an enum.
  This way, code that wants to make use of a specific filter set, can index it
  directly via the enum, instead of having to search for it. This makes
  apply_private_devices() in particular a lot simpler.

- Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to
  find a set by its name, seccomp_add_syscall_filter_set() to add a set to a
  seccomp object.

- Update SystemCallFilter= parser to use extract_first_word().  Let's work on
  deprecating FOREACH_WORD_QUOTED().

- Simplify apply_private_devices() using this functionality
2016-10-24 17:32:50 +02:00
Felipe Sateler 83f12b27d1 core: do not fail at step SECCOMP if there is no kernel support (#4004)
Fixes #3882
2016-08-22 22:40:58 +03:00
Topi Miettinen 201c1cc22a core: add pre-defined syscall groups to SystemCallFilter= (#3053) (#3157)
Implement sets of system calls to help constructing system call
filters. A set starts with '@' to distinguish from a system call.

Closes: #3053, #3157
2016-06-01 11:56:01 +02:00
Daniel Mack b26fa1a2fb tree-wide: remove Emacs lines from all files
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
2016-02-10 13:41:57 +01:00
Thomas Hindoe Paaboel Andersen a8fbdf5424 shared: include what we use
The next step of a general cleanup of our includes. This one mostly
adds missing includes but there are a few removals as well.
2015-12-06 13:49:33 +01:00
Lennart Poettering a60e9f7fc8 seccomp-util.h: make sure seccomp-util.h can be included alone 2014-12-12 13:35:32 +01:00
Lennart Poettering e9642be2cc seccomp: add helper call to add all secondary archs to a seccomp filter
And make use of it where appropriate for executing services and for
nspawn.
2014-02-18 22:14:00 +01:00
Lennart Poettering 57183d117a core: add SystemCallArchitectures= unit setting to allow disabling of non-native
architecture support for system calls

Also, turn system call filter bus properties into complex types instead
of concatenated strings.
2014-02-13 00:24:00 +01:00