Commit graph

266 commits

Author SHA1 Message Date
Lennart Poettering ba128bb809 execute: filter low-level I/O syscalls if PrivateDevices= is set
If device access is restricted via PrivateDevices=, let's also block the
various low-level I/O syscalls at the same time, so that we know that the
minimal set of devices in our virtualized /dev are really everything the unit
can access.
2016-09-25 10:52:57 +02:00
Lennart Poettering 096424d123 execute: drop group priviliges only after setting up namespace
If PrivateDevices=yes is set, the namespace code creates device nodes in /dev
that should be owned by the host's root, hence let's make sure we set up the
namespace before dropping group privileges.
2016-09-25 10:42:18 +02:00
Lennart Poettering 3fbe8dbe41 execute: if RuntimeDirectory= is set, it should be writable
Implicitly make all dirs set with RuntimeDirectory= writable, as the concept
otherwise makes no sense.
2016-09-25 10:19:05 +02:00
Lennart Poettering be39ccf3a0 execute: move suppression of HOME=/ and SHELL=/bin/nologin into user-util.c
This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
2016-09-25 10:18:57 +02:00
Lennart Poettering 07689d5d2c execute: split out creation of runtime dirs into its own functions 2016-09-25 10:18:54 +02:00
Lennart Poettering 59eeb84ba6 core: add two new service settings ProtectKernelTunables= and ProtectControlGroups=
If enabled, these will block write access to /sys, /proc/sys and
/proc/sys/fs/cgroup.
2016-09-25 10:18:48 +02:00
Lennart Poettering 72246c2a65 core: enforce seccomp for secondary archs too, for all rules
Let's make sure that all our rules apply to all archs the local kernel
supports.
2016-09-25 10:18:44 +02:00
Felipe Sateler d347d9029c seccomp: also detect if seccomp filtering is enabled
In https://github.com/systemd/systemd/pull/4004 , a runtime detection
method for seccomp was added. However, it does not detect the case
where CONFIG_SECCOMP=y but CONFIG_SECCOMP_FILTER=n. This is possible
if the architecture does not support filtering yet.
Add a check for that case too.

While at it, change get_proc_field usage to use PR_GET_SECCOMP prctl,
as that should save a few system calls and (unnecessary) allocations.
Previously, reading of /proc/self/stat was done as recommended by
prctl(2) as safer. However, given that we need to do the prctl call
anyway, lets skip opening, reading and parsing the file.

Code for checking inspired by
https://outflux.net/teach-seccomp/autodetect.html
2016-09-06 20:25:49 -03:00
Felipe Sateler 83f12b27d1 core: do not fail at step SECCOMP if there is no kernel support (#4004)
Fixes #3882
2016-08-22 22:40:58 +03:00
Lennart Poettering fd63e712b2 core: bypass dynamic user lookups from dbus-daemon
dbus-daemon does NSS name look-ups in order to enforce its bus policy. This
might dead-lock if an NSS module use wants to use D-Bus for the look-up itself,
like our nss-systemd does. Let's work around this by bypassing bus
communication in the NSS module if we run inside of dbus-daemon. To make this
work we keep a bit of extra state in /run/systemd/dynamic-uid/ so that we don't
have to consult the bus, but can still resolve the names.

Note that the normal codepath continues to be via the bus, so that resolving
works from all mount namespaces and is subject to authentication, as before.

This is a bit dirty, but not too dirty, as dbus daemon is kinda special anyway
for PID 1.
2016-08-19 00:50:24 +02:00
Lennart Poettering 00d9ef8560 core: add RemoveIPC= setting
This adds the boolean RemoveIPC= setting to service, socket, mount and swap
units (i.e.  all unit types that may invoke processes). if turned on, and the
unit's user/group is not root, all IPC objects of the user/group are removed
when the service is shut down. The life-cycle of the IPC objects is hence bound
to the unit life-cycle.

This is particularly relevant for units with dynamic users, as it is essential
that no objects owned by the dynamic users survive the service exiting. In
fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set.

In order to communicate the UID/GID of an executed process back to PID 1 this
adds a new "user lookup" socket pair, that is inherited into the forked
processes, and closed before the exec(). This is needed since we cannot do NSS
from PID 1 due to deadlock risks, However need to know the used UID/GID in
order to clean up IPC owned by it if the unit shuts down.
2016-08-19 00:37:25 +02:00
Lennart Poettering 92b25bcabb core: make use of uid_is_valid() when checking for UID validity 2016-08-18 22:49:48 +02:00
Zbigniew Jędrzejewski-Szmek d87a2ef782 Merge pull request #3884 from poettering/private-users 2016-08-06 17:04:45 -04:00
Lennart Poettering b08af3b127 core: only set the watchdog variables in ExecStart= lines 2016-08-04 23:08:05 +02:00
Lennart Poettering af9d16e10a core: use the correct APIs to determine whether a dual timestamp is initialized 2016-08-04 16:27:07 +02:00
Lennart Poettering c39f1ce24d core: turn various execution flags into a proper flags parameter
The ExecParameters structure contains a number of bit-flags, that were so far
exposed as bool:1, change this to a proper, single binary bit flag field. This
makes things a bit more expressive, and is helpful as we add more flags, since
these booleans are passed around in various callers, for example
service_spawn(), whose signature can be made much shorter now.

Not all bit booleans from ExecParameters are moved into the flags field for
now, but this can be added later.
2016-08-04 16:27:07 +02:00
Lennart Poettering d251207d55 core: add new PrivateUsers= option to service execution
This setting adds minimal user namespacing support to a service. When set the invoked
processes will run in their own user namespace. Only a trivial mapping will be
set up: the root user/group is mapped to root, and the user/group of the
service will be mapped to itself, everything else is mapped to nobody.

If this setting is used the service runs with no capabilities on the host, but
configurable capabilities within the service.

This setting is particularly useful in conjunction with RootDirectory= as the
need to synchronize /etc/passwd and /etc/group between the host and the service
OS tree is reduced, as only three UID/GIDs need to match: root, nobody and the
user of the service itself. But even outside the RootDirectory= case this
setting is useful to substantially reduce the attack surface of a service.

Example command to test this:

        systemd-run -p PrivateUsers=1 -p User=foobar -t /bin/sh

This runs a shell as user "foobar". When typing "ps" only processes owned by
"root", by "foobar", and by "nobody" should be visible.
2016-08-03 20:42:04 +02:00
Lennart Poettering 7049382803 execute: don't set $SHELL and $HOME for services, if they don't contain interesting data 2016-08-03 14:52:16 +02:00
Lennart Poettering 6af760f3b2 core: inherit TERM from PID 1 for all services started on /dev/console
This way, invoking nspawn from a shell in the best case inherits the TERM
setting all the way down into the login shell spawned in the container.

Fixes: #3697
2016-08-03 14:52:16 +02:00
Lennart Poettering 409093fe10 nss: add new "nss-systemd" NSS module for mapping dynamic users
With this NSS module all dynamic service users will be resolvable via NSS like
any real user.
2016-07-22 15:53:45 +02:00
Lennart Poettering 29206d4619 core: add a concept of "dynamic" user ids, that are allocated as long as a service is running
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).

For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.

A simple way to test this is:

        systemd-run -p DynamicUser=1 /bin/sleep 99999
2016-07-22 15:53:45 +02:00
Lennart Poettering 33df919d5c execute: make sure JoinsNamespaceOf= doesn't leak ns fds to executed processes 2016-07-20 14:53:15 +02:00
Lennart Poettering 7a1ab780c4 execute: normalize connect_logger_as() parameters slightly
All other functions in execute.c that need the unit id take a Unit* parameter
as first argument. Let's change connect_logger_as() to follow a similar logic.
2016-07-20 14:53:15 +02:00
Alessandro Puccetti 2a624c36e6 doc,core: Read{Write,Only}Paths= and InaccessiblePaths=
This patch renames Read{Write,Only}Directories= and InaccessibleDirectories=
to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept
as aliases but they are not advertised in the documentation.

Renamed variables:
`read_write_dirs` --> `read_write_paths`
`read_only_dirs` --> `read_only_paths`
`inaccessible_dirs` --> `inaccessible_paths`
2016-07-19 17:22:02 +02:00
Torstein Husebø 61233823aa treewide: fix typos and remove accidental repetition of words 2016-07-11 16:18:43 +02:00
Jouke Witteveen 84eada2f7f execute: Do not alter call-by-ref parameter on failure
Prevent free from being called on (a part of) the call-by-reference
variable env when setup_pam fails.
2016-07-08 09:42:48 +02:00
Jouke Witteveen 1280503b7e execute: Cleanup the environment early
By cleaning up before setting up PAM we maintain control of overriding
behavior in setting variables. Otherwise, pam_putenv is in control.
This also makes sure we use a cleaned up environment in replacing
variables in argv.
2016-07-07 14:15:50 +02:00
Lennart Poettering f4170c671b execute: add a new easy-to-use RestrictRealtime= option to units
It takes a boolean value. If true, access to SCHED_RR, SCHED_FIFO and
SCHED_DEADLINE is blocked, which my be used to lock up the system.
2016-06-23 01:45:45 +02:00
Lennart Poettering abd84d4d83 execute: be a little less drastic when MemoryDenyWriteExecute= hits
Let's politely refuse with EPERM rather than kill the whole thing right-away.
2016-06-23 01:35:04 +02:00
Lennart Poettering 686d9ba614 execute: set PR_SET_NO_NEW_PRIVS also in case the exec memory protection is used
This was forgotten when MemoryDenyWriteExecute= was added: we should set NNP in
all cases when we set seccomp filters.
2016-06-23 01:33:07 +02:00
Lennart Poettering 03857c43ce execute: use the return value of setrlimit_closest() properly
It's a function defined by us, hence we should look for the error in its return
value, not in "errno".
2016-06-23 01:31:24 +02:00
Lennart Poettering 7bce046bcf core: set $JOURNAL_STREAM to the dev_t/ino_t of the journal stream of executed services
This permits services to detect whether their stdout/stderr is connected to the
journal, and if so talk to the journal directly, thus permitting carrying of
metadata.

As requested by the gtk folks: #2473
2016-06-15 23:00:27 +02:00
Lennart Poettering fd1f9c89f7 execute: minor coding style improvements 2016-06-15 22:51:01 +02:00
Jouke Witteveen 2065ca699b core/execute: pass env vars to PAM session setup (#3503)
Move the merger of environment variables before setting up the PAM
session and pass the aggregate environment to PAM setup. This allows
control over the PAM session hooks through environment variables.

PAM session initiation may update the environment. On successful
initiation of a PAM session, we adopt the environment of the
PAM context.
2016-06-13 12:50:12 +02:00
Alessandro Puccetti cf677fe686 core/execute: add the magic character '!' to allow privileged execution (#3493)
This patch implements the new magic character '!'. By putting '!' in front
of a command, systemd executes it with full privileges ignoring paramters
such as User, Group, SupplementaryGroups, CapabilityBoundingSet,
AmbientCapabilities, SecureBits, SystemCallFilter, SELinuxContext,
AppArmorProfile, SmackProcessLabel, and RestrictAddressFamilies.

Fixes partially https://github.com/systemd/systemd/issues/3414
Related to https://github.com/coreos/rkt/issues/2482

Testing:
1. Create a user 'bob'
2. Create the unit file /etc/systemd/system/exec-perm.service
   (You can use the example below)
3. sudo systemctl start ext-perm.service
4. Verify that the commands starting with '!' were not executed as bob,
   4.1 Looking to the output of ls -l /tmp/exec-perm
   4.2 Each file contains the result of the id command.

`````````````````````````````````````````````````````````````````
[Unit]
Description=ext-perm

[Service]
Type=oneshot
TimeoutStartSec=0
User=bob
ExecStartPre=!/usr/bin/sh -c "/usr/bin/rm /tmp/exec-perm*" ;
    /usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-pre"
ExecStart=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start" ;
    !/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-star-2"
ExecStartPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-post"
ExecReload=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-reload"
ExecStop=!/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop"
ExecStopPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop-post"

[Install]
WantedBy=multi-user.target]
`````````````````````````````````````````````````````````````````
2016-06-10 18:19:54 +02:00
Lennart Poettering 1ff74fb6e3 execute: check whether the specified fd is a tty before chowning/chmoding it (#3457)
Let's add an extra safety check before we chmod/chown a TTY to the right user,
as we might end up having connected something to STDIN/STDOUT that is actually
not a TTY, even though this might have been requested, due to permissive
StandardInput= settings or transient service activation with fds passed in.

Fixes:

https://bugs.freedesktop.org/show_bug.cgi?id=85255
2016-06-09 10:01:16 +02:00
Topi Miettinen f3e4363593 core: Restrict mmap and mprotect with PAGE_WRITE|PAGE_EXEC (#3319) (#3379)
New exec boolean MemoryDenyWriteExecute, when set, installs
a seccomp filter to reject mmap(2) with PAGE_WRITE|PAGE_EXEC
and mprotect(2) with PAGE_EXEC.
2016-06-03 17:58:18 +02:00
Lennart Poettering fc2fffe770 tree-wide: introduce new SOCKADDR_UN_LEN() macro, and use it everywhere
The macro determines the right length of a AF_UNIX "struct sockaddr_un" to pass to
connect() or bind(). It automatically figures out if the socket refers to an
abstract namespace socket, or a socket in the file system, and properly handles
the full length of the path field.

This macro is not only safer, but also simpler to use, than the usual
offsetof() + strlen() logic.
2016-05-05 22:24:36 +02:00
Daniel Mack 68de79d6a4 Merge pull request #2760 from ronnychevalier/rc/core_no_new_privileges_seccompv3
core: set NoNewPrivileges for seccomp if we don't have CAP_SYS_ADMIN
2016-03-21 12:57:43 +01:00
Ronny Chevalier 19c0b0b9a5 core: set NoNewPrivileges for seccomp if we don't have CAP_SYS_ADMIN
The manpage of seccomp specify that using seccomp with
SECCOMP_SET_MODE_FILTER will return EACCES if the caller do not have
CAP_SYS_ADMIN set, or if the no_new_privileges bit is not set. Hence,
without NoNewPrivilege set, it is impossible to use a SystemCall*
directive with a User directive set in system mode.

Now, NoNewPrivileges is set if we are in user mode, or if we are in
system mode and we don't have CAP_SYS_ADMIN, and SystemCall*
directives are used.
2016-02-28 14:44:26 +01:00
Thomas Hindoe Paaboel Andersen 7f508f2c74 tree-wide: indentation fixes 2016-02-26 22:23:38 +01:00
Vito Caputo 313cefa1d9 tree-wide: make ++/-- usage consistent WRT spacing
Throughout the tree there's spurious use of spaces separating ++ and --
operators from their respective operands.  Make ++ and -- operator
consistent with the majority of existing uses; discard the spaces.
2016-02-22 20:32:04 -08:00
Lennart Poettering 479050b363 core: drop Capabilities= setting
The setting is hardly useful (since its effect is generally reduced to zero due
to file system caps), and with the advent of ambient caps an actually useful
replacement exists, hence let's get rid of this.

I am pretty sure this was unused and our man page already recommended against
its use, hence this should be a safe thing to remove.
2016-02-13 11:59:34 +01:00
Daniel Mack 9ca6ff50ab Remove kdbus custom endpoint support
This feature will not be used anytime soon, so remove a bit of cruft.

The BusPolicy= config directive will stay around as compat noop.
2016-02-11 22:12:04 +01:00
Daniel Mack b26fa1a2fb tree-wide: remove Emacs lines from all files
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
2016-02-10 13:41:57 +01:00
Lennart Poettering 1e22b5cda0 core: don't reset /dev/console if stdin/stdout/stderr as passed as fd in a transient service
Otherwise we might end resetting /dev/console all the time when a transient service starts or stops.

Fixes #2377
Fixes #2198
Fixes #2061
2016-01-28 16:25:39 +01:00
Lennart Poettering 7bb70b6e3d core: normalize error handling a bit, in setup_pam()
Assign errno-style errors to a variable called "r" when they happen, the same way we do this in most other calls. It's
bad enough that the error handling part of the function deals with two different error variables (pam_code and r) now,
but before this fix it was even three!
2016-01-25 17:19:19 +01:00
Zbigniew Jędrzejewski-Szmek 2a836ca970 systemd: remove dead code
We only go to fail label if pam_pid <= 0.

CID #1306746.
2016-01-20 18:55:56 -05:00
Zbigniew Jędrzejewski-Szmek b326715278 tree-wide: check if errno is greater than zero (2)
Compare errno with zero in a way that tells gcc that
(if the condition is true) errno is positive.
2016-01-13 15:10:17 -05:00
Zbigniew Jędrzejewski-Szmek f5e5c28f42 tree-wide: check if errno is greater then zero
gcc is confused by the common idiom of
  return errno ? -errno : -ESOMETHING
and thinks a positive value may be returned. Replace this condition
with errno > 0 to help gcc and avoid many spurious warnings. I filed
a gcc rfe a long time ago, but it hard to say if it will ever be
implemented [1].

Both conventions were used in the codebase, this change makes things
more consistent. This is a follow up to bcb161b023.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61846
2016-01-13 15:09:55 -05:00