Commit Graph

25 Commits

Author SHA1 Message Date
Lennart Poettering add005357d core: add new RestrictNamespaces= unit file setting
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().

RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.

This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
2016-11-04 07:40:13 -06:00
Zbigniew Jędrzejewski-Szmek d5efc18b60 seccomp-util, analyze: export comments as a help string
Just to make the whole thing easier for users.
2016-11-03 09:35:36 -04:00
Zbigniew Jędrzejewski-Szmek 40eb6a8014 seccomp-util: move @default to the first position
Now that the list is user-visible, @default should be first.
2016-11-03 09:35:36 -04:00
Lennart Poettering 133ddbbeae seccomp: add two new syscall groups
@resources contains various syscalls that alter resource limits and memory and
scheduling parameters of processes. As such they are good candidates to block
for most services.

@basic-io contains a number of basic syscalls for I/O, similar to the list
seccomp v1 permitted but slightly more complete. It should be useful for
building basic whitelisting for minimal sandboxes
2016-11-02 08:50:00 -06:00
Lennart Poettering cd5bfd7e60 seccomp: include pipes and memfd in @ipc
These system calls clearly fall in the @ipc category, hence should be listed
there, simply to avoid confusion and surprise by the user.
2016-11-02 08:50:00 -06:00
Lennart Poettering a8c157ff30 seccomp: drop execve() from @process list
The system call is already part in @default hence implicitly allowed anyway.
Also, if it is actually blocked then systemd couldn't execute the service in
question anymore, since the application of seccomp is immediately followed by
it.
2016-11-02 08:49:59 -06:00
Lennart Poettering c79aff9a82 seccomp: add clock query and sleeping syscalls to "@default" group
Timing and sleep are so basic operations, it makes very little sense to ever
block them, hence don't.
2016-11-02 08:49:59 -06:00
Zbigniew Jędrzejewski-Szmek aa34055ffb seccomp: allow specifying arm64, mips, ppc (#4491)
"Secondary arch" table for mips is entirely speculative…
2016-11-01 09:33:18 -06:00
Lennart Poettering a3be2849b2 seccomp: add new helper call seccomp_load_filter_set()
This allows us to unify most of the code in apply_protect_kernel_modules() and
apply_private_devices().
2016-10-24 17:32:50 +02:00
Lennart Poettering 60f547cf68 seccomp: two fixes for the syscall set tables
"oldumount()" is not a syscall, but simply a wrapper for it, the actual syscall
nr is called "umount" (and the nr of umount() is called umount2 internally).

"sysctl()" is not a syscall, but "_syscall()" is. Fix this in the table.

Without these changes libseccomp cannot actually translate the tables in full.
This wasn't noticed before as the code was written defensively for this case.
2016-10-24 17:32:50 +02:00
Lennart Poettering 8d7b0c8fd7 seccomp: add new seccomp_init_conservative() helper
This adds a new seccomp_init_conservative() helper call that is mostly just a
wrapper around seccomp_init(), but turns off NNP and adds in all secondary
archs, for best compatibility with everything else.

Pretty much all of our code used the very same constructs for these three
steps, hence unifying this in one small function makes things a lot shorter.

This also changes incorrect usage of the "scmp_filter_ctx" type at various
places. libseccomp defines it as typedef to "void*", i.e. it is a pointer type
(pretty poor choice already!) that casts implicitly to and from all other
pointer types (even poorer choice: you defined a confusing type now, and don't
even gain any bit of type safety through it...). A lot of the code assumed the
type would refer to a structure, and hence aded additional "*" here and there.
Remove that.
2016-10-24 17:32:50 +02:00
Lennart Poettering 8130926d32 core: rework syscall filter set handling
A variety of fixes:

- rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main
  instance of it (the syscall_filter_sets[] array) used to abbreviate
  "SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not
  mix and match too wildly. Let's pick the shorter name in this case, as it is
  sufficiently well established to not confuse hackers reading this.

- Export explicit indexes into the syscall_filter_sets[] array via an enum.
  This way, code that wants to make use of a specific filter set, can index it
  directly via the enum, instead of having to search for it. This makes
  apply_private_devices() in particular a lot simpler.

- Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to
  find a set by its name, seccomp_add_syscall_filter_set() to add a set to a
  seccomp object.

- Update SystemCallFilter= parser to use extract_first_word().  Let's work on
  deprecating FOREACH_WORD_QUOTED().

- Simplify apply_private_devices() using this functionality
2016-10-24 17:32:50 +02:00
hbrueckner 6abfd30372 seccomp: add support for the s390 architecture (#4287)
Add seccomp support for the s390 architecture (31-bit and 64-bit)
to systemd.

This requires libseccomp >= 2.3.1.
2016-10-05 13:58:55 +02:00
Felipe Sateler d347d9029c seccomp: also detect if seccomp filtering is enabled
In https://github.com/systemd/systemd/pull/4004 , a runtime detection
method for seccomp was added. However, it does not detect the case
where CONFIG_SECCOMP=y but CONFIG_SECCOMP_FILTER=n. This is possible
if the architecture does not support filtering yet.
Add a check for that case too.

While at it, change get_proc_field usage to use PR_GET_SECCOMP prctl,
as that should save a few system calls and (unnecessary) allocations.
Previously, reading of /proc/self/stat was done as recommended by
prctl(2) as safer. However, given that we need to do the prctl call
anyway, lets skip opening, reading and parsing the file.

Code for checking inspired by
https://outflux.net/teach-seccomp/autodetect.html
2016-09-06 20:25:49 -03:00
Evgeny Vereshchagin 6afe14ff5b Merge pull request #3984 from poettering/refcnt
permit bus clients to pin units to avoid automatic GC
2016-08-26 16:17:05 +03:00
Felipe Sateler 83f12b27d1 core: do not fail at step SECCOMP if there is no kernel support (#4004)
Fixes #3882
2016-08-22 22:40:58 +03:00
Lennart Poettering 4a4485ae69 seccomp: make sure getrlimit() is among the default permitted syscalls
A lot of basic code wants to know the stack size, and it is safe if they do,
hence let's permit getrlimit() (but not setrlimit()) by default.

See: #3970
2016-08-22 14:17:23 +02:00
Lennart Poettering 1f9ac68b5b core: improve seccomp syscall grouping a bit
This adds three new seccomp syscall groups: @keyring for kernel keyring access,
@cpu-emulation for CPU emulation features, for exampe vm86() for dosemu and
suchlike, and @debug for ptrace() and related calls.

Also, the @clock group is updated with more syscalls that alter the system
clock. capset() is added to @privileged, and pciconfig_iobase() is added to
@raw-io.

Finally, @obsolete is a cleaned up. A number of syscalls that never existed on
Linux and have no number assigned on any architecture are removed, as they only
exist in the man pages and other operating sytems, but not in code at all.
create_module() is moved from @module to @obsolete, as it is an obsolete system
call. mem_getpolicy() is removed from the @obsolete list, as it is not
obsolete, but simply a NUMA API.
2016-06-13 16:25:54 +02:00
Topi Miettinen 201c1cc22a core: add pre-defined syscall groups to SystemCallFilter= (#3053) (#3157)
Implement sets of system calls to help constructing system call
filters. A set starts with '@' to distinguish from a system call.

Closes: #3053, #3157
2016-06-01 11:56:01 +02:00
Daniel Mack b26fa1a2fb tree-wide: remove Emacs lines from all files
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
2016-02-10 13:41:57 +01:00
Thomas Hindoe Paaboel Andersen a8fbdf5424 shared: include what we use
The next step of a general cleanup of our includes. This one mostly
adds missing includes but there are a few removals as well.
2015-12-06 13:49:33 +01:00
Thomas Hindoe Paaboel Andersen cf0fbc49e6 tree-wide: sort includes
Sort the includes accoding to the new coding style.
2015-11-16 22:09:36 +01:00
Lennart Poettering 07630cea1f util-lib: split our string related calls from util.[ch] into its own file string-util.[ch]
There are more than enough calls doing string manipulations to deserve
its own files, hence do something about it.

This patch also sorts the #include blocks of all files that needed to be
updated, according to the sorting suggestions from CODING_STYLE. Since
pretty much every file needs our string manipulation functions this
effectively means that most files have sorted #include blocks now.

Also touches a few unrelated include files.
2015-10-24 23:05:02 +02:00
Lennart Poettering e9642be2cc seccomp: add helper call to add all secondary archs to a seccomp filter
And make use of it where appropriate for executing services and for
nspawn.
2014-02-18 22:14:00 +01:00
Lennart Poettering 57183d117a core: add SystemCallArchitectures= unit setting to allow disabling of non-native
architecture support for system calls

Also, turn system call filter bus properties into complex types instead
of concatenated strings.
2014-02-13 00:24:00 +01:00