Commit Graph

200 Commits

Author SHA1 Message Date
Lennart Poettering 4b58153dd2 core: add "invocation ID" concept to service manager
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.

The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.

The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.

The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.

Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.

This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().

PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.

A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.

Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
2016-10-07 20:14:38 +02:00
hbrueckner 6abfd30372 seccomp: add support for the s390 architecture (#4287)
Add seccomp support for the s390 architecture (31-bit and 64-bit)
to systemd.

This requires libseccomp >= 2.3.1.
2016-10-05 13:58:55 +02:00
Stefan Schweter cfaf4b75e0 man: remove consecutive duplicate words (#4268)
This PR removes consecutive duplicate words from the man pages of:

* `resolved.conf.xml`
* `systemd.exec.xml`
* `systemd.socket.xml`
2016-10-03 17:09:54 +02:00
Djalal Harouni 8f81a5f61b core: Use @raw-io syscall group to filter I/O syscalls when PrivateDevices= is set
Instead of having a local syscall list, use the @raw-io group which
contains the same set of syscalls to filter.
2016-09-25 12:52:27 +02:00
Djalal Harouni 49accde7bd core:sandbox: add more /proc/* entries to ProtectKernelTunables=
Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface,
filesystems configuration and IRQ tuning readonly.

Most of these interfaces now days should be in /sys but they are still
available through /proc, so just protect them. This patch does not touch
/proc/net/...
2016-09-25 11:30:11 +02:00
Djalal Harouni 9221aec8d0 doc: explicitly document that /dev/mem and /dev/port are blocked by PrivateDevices=true 2016-09-25 11:25:44 +02:00
Djalal Harouni e778185bb5 doc: documentation fixes for ReadWritePaths= and ProtectKernelTunables=
Documentation fixes for ReadWritePaths= and ProtectKernelTunables=
as reported by Evgeny Vereshchagin.
2016-09-25 11:25:31 +02:00
Lennart Poettering 6757c06a1a man: shorten the exit status table a bit
Let's merge a couple of columns, to make the table a bit shorter. This
effectively just drops whitespace, not contents, but makes the currently
humungous table much much more compact.
2016-09-25 10:52:57 +02:00
Lennart Poettering 81c8aceed4 man: the exit code/signal is stored in $EXIT_CODE, not $EXIT_STATUS 2016-09-25 10:52:57 +02:00
Lennart Poettering effbd6d2ea man: rework documentation for ReadOnlyPaths= and related settings
This reworks the documentation for ReadOnlyPaths=, ReadWritePaths=,
InaccessiblePaths=. It no longer claims that we'd follow symlinks relative to
the host file system. (Which wasn't true actually, as we didn't follow symlinks
at all in the most recent releases, and we know do follow them, but relative to
RootDirectory=).

This also replaces all references to the fact that all fs namespacing options
can be undone with enough privileges and disable propagation by a single one in
the documentation of ReadOnlyPaths= and friends, and then directs the read to
this in all other places.

Moreover a hint is added to the documentation of SystemCallFilter=, suggesting
usage of ~@mount in case any of the fs namespacing related options are used.
2016-09-25 10:42:18 +02:00
Lennart Poettering b2656f1b1c man: in user-facing documentaiton don't reference C function names
Let's drop the reference to the cap_from_name() function in the documentation
for the capabilities setting, as it is hardly helpful. Our readers are not
necessarily C hackers knowing the semantics of cap_from_name(). Moreover, the
strings we accept are just the plain capability names as listed in
capabilities(7) hence there's really no point in confusing the user with
anything else.
2016-09-25 10:42:18 +02:00
Lennart Poettering 63bb64a056 core: imply ProtectHome=read-only and ProtectSystem=strict if DynamicUser=1
Let's make sure that services that use DynamicUser=1 cannot leave files in the
file system should the system accidentally have a world-writable directory
somewhere.

This effectively ensures that directories need to be whitelisted rather than
blacklisted for access when DynamicUser=1 is set.
2016-09-25 10:42:18 +02:00
Lennart Poettering 3f815163ff core: introduce ProtectSystem=strict
Let's tighten our sandbox a bit more: with this change ProtectSystem= gains a
new setting "strict". If set, the entire directory tree of the system is
mounted read-only, but the API file systems /proc, /dev, /sys are excluded
(they may be managed with PrivateDevices= and ProtectKernelTunables=). Also,
/home and /root are excluded as those are left for ProtectHome= to manage.

In this mode, all "real" file systems (i.e. non-API file systems) are mounted
read-only, and specific directories may only be excluded via
ReadWriteDirectories=, thus implementing an effective whitelist instead of
blacklist of writable directories.

While we are at, also add /efi to the list of paths always affected by
ProtectSystem=. This is a follow-up for
b52a109ad3 which added /efi as alternative for
/boot. Our namespacing logic should respect that too.
2016-09-25 10:42:18 +02:00
Lennart Poettering 59eeb84ba6 core: add two new service settings ProtectKernelTunables= and ProtectControlGroups=
If enabled, these will block write access to /sys, /proc/sys and
/proc/sys/fs/cgroup.
2016-09-25 10:18:48 +02:00
Lennart Poettering 00d9ef8560 core: add RemoveIPC= setting
This adds the boolean RemoveIPC= setting to service, socket, mount and swap
units (i.e.  all unit types that may invoke processes). if turned on, and the
unit's user/group is not root, all IPC objects of the user/group are removed
when the service is shut down. The life-cycle of the IPC objects is hence bound
to the unit life-cycle.

This is particularly relevant for units with dynamic users, as it is essential
that no objects owned by the dynamic users survive the service exiting. In
fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set.

In order to communicate the UID/GID of an executed process back to PID 1 this
adds a new "user lookup" socket pair, that is inherited into the forked
processes, and closed before the exec(). This is needed since we cannot do NSS
from PID 1 due to deadlock risks, However need to know the used UID/GID in
order to clean up IPC owned by it if the unit shuts down.
2016-08-19 00:37:25 +02:00
Zbigniew Jędrzejewski-Szmek 29df65f913 man: add "timeout" to status table (#3919) 2016-08-11 10:51:49 +02:00
Lennart Poettering 56bf97e10f Merge pull request #3914 from keszybz/fix-man-links
Fix man links
2016-08-07 11:17:56 +02:00
Zbigniew Jędrzejewski-Szmek e64e1bfd86 man: add a table of possible exit statuses (#3910) 2016-08-07 11:14:40 +02:00
Zbigniew Jędrzejewski-Szmek d87a2ef782 Merge pull request #3884 from poettering/private-users 2016-08-06 17:04:45 -04:00
Zbigniew Jędrzejewski-Szmek 0a07667d8d man: provide html links to a bunch of external man pages 2016-08-06 16:39:53 -04:00
Lennart Poettering 136dc4c435 core: set $SERVICE_RESULT, $EXIT_CODE and $EXIT_STATUS in ExecStop=/ExecStopPost= commands
This should simplify monitoring tools for services, by passing the most basic
information about service result/exit information via environment variables,
thus making it unnecessary to retrieve them explicitly via the bus.
2016-08-04 23:08:05 +02:00
Lennart Poettering d251207d55 core: add new PrivateUsers= option to service execution
This setting adds minimal user namespacing support to a service. When set the invoked
processes will run in their own user namespace. Only a trivial mapping will be
set up: the root user/group is mapped to root, and the user/group of the
service will be mapped to itself, everything else is mapped to nobody.

If this setting is used the service runs with no capabilities on the host, but
configurable capabilities within the service.

This setting is particularly useful in conjunction with RootDirectory= as the
need to synchronize /etc/passwd and /etc/group between the host and the service
OS tree is reduced, as only three UID/GIDs need to match: root, nobody and the
user of the service itself. But even outside the RootDirectory= case this
setting is useful to substantially reduce the attack surface of a service.

Example command to test this:

        systemd-run -p PrivateUsers=1 -p User=foobar -t /bin/sh

This runs a shell as user "foobar". When typing "ps" only processes owned by
"root", by "foobar", and by "nobody" should be visible.
2016-08-03 20:42:04 +02:00
Zbigniew Jędrzejewski-Szmek dadd6ecfa5 Merge pull request #3728 from poettering/dynamic-users 2016-07-25 16:40:26 -04:00
Lennart Poettering 43eb109aa9 core: change ExecStart=! syntax to ExecStart=+ (#3797)
As suggested by @mbiebl we already use the "!" special char in unit file
assignments for negation, hence we should not use it in a different context for
privileged execution. Let's use "+" instead.
2016-07-25 16:53:33 +02:00
Lennart Poettering 29206d4619 core: add a concept of "dynamic" user ids, that are allocated as long as a service is running
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).

For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.

A simple way to test this is:

        systemd-run -p DynamicUser=1 /bin/sleep 99999
2016-07-22 15:53:45 +02:00
Alessandro Puccetti 2a624c36e6 doc,core: Read{Write,Only}Paths= and InaccessiblePaths=
This patch renames Read{Write,Only}Directories= and InaccessibleDirectories=
to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept
as aliases but they are not advertised in the documentation.

Renamed variables:
`read_write_dirs` --> `read_write_paths`
`read_only_dirs` --> `read_only_paths`
`inaccessible_dirs` --> `inaccessible_paths`
2016-07-19 17:22:02 +02:00
Alessandro Puccetti c4b4170746 namespace: unify limit behavior on non-directory paths
Despite the name, `Read{Write,Only}Directories=` already allows for
regular file paths to be masked. This commit adds the same behavior
to `InaccessibleDirectories=` and makes it explicit in the doc.
This patch introduces `/run/systemd/inaccessible/{reg,dir,chr,blk,fifo,sock}`
{dile,device}nodes and mounts on the appropriate one the paths specified
in `InacessibleDirectories=`.

Based on Luca's patch from https://github.com/systemd/systemd/pull/3327
2016-07-19 17:22:02 +02:00
Lennart Poettering f4170c671b execute: add a new easy-to-use RestrictRealtime= option to units
It takes a boolean value. If true, access to SCHED_RR, SCHED_FIFO and
SCHED_DEADLINE is blocked, which my be used to lock up the system.
2016-06-23 01:45:45 +02:00
Lennart Poettering 7bce046bcf core: set $JOURNAL_STREAM to the dev_t/ino_t of the journal stream of executed services
This permits services to detect whether their stdout/stderr is connected to the
journal, and if so talk to the journal directly, thus permitting carrying of
metadata.

As requested by the gtk folks: #2473
2016-06-15 23:00:27 +02:00
Lennart Poettering 1f9ac68b5b core: improve seccomp syscall grouping a bit
This adds three new seccomp syscall groups: @keyring for kernel keyring access,
@cpu-emulation for CPU emulation features, for exampe vm86() for dosemu and
suchlike, and @debug for ptrace() and related calls.

Also, the @clock group is updated with more syscalls that alter the system
clock. capset() is added to @privileged, and pciconfig_iobase() is added to
@raw-io.

Finally, @obsolete is a cleaned up. A number of syscalls that never existed on
Linux and have no number assigned on any architecture are removed, as they only
exist in the man pages and other operating sytems, but not in code at all.
create_module() is moved from @module to @obsolete, as it is an obsolete system
call. mem_getpolicy() is removed from the @obsolete list, as it is not
obsolete, but simply a NUMA API.
2016-06-13 16:25:54 +02:00
Alessandro Puccetti cf677fe686 core/execute: add the magic character '!' to allow privileged execution (#3493)
This patch implements the new magic character '!'. By putting '!' in front
of a command, systemd executes it with full privileges ignoring paramters
such as User, Group, SupplementaryGroups, CapabilityBoundingSet,
AmbientCapabilities, SecureBits, SystemCallFilter, SELinuxContext,
AppArmorProfile, SmackProcessLabel, and RestrictAddressFamilies.

Fixes partially https://github.com/systemd/systemd/issues/3414
Related to https://github.com/coreos/rkt/issues/2482

Testing:
1. Create a user 'bob'
2. Create the unit file /etc/systemd/system/exec-perm.service
   (You can use the example below)
3. sudo systemctl start ext-perm.service
4. Verify that the commands starting with '!' were not executed as bob,
   4.1 Looking to the output of ls -l /tmp/exec-perm
   4.2 Each file contains the result of the id command.

`````````````````````````````````````````````````````````````````
[Unit]
Description=ext-perm

[Service]
Type=oneshot
TimeoutStartSec=0
User=bob
ExecStartPre=!/usr/bin/sh -c "/usr/bin/rm /tmp/exec-perm*" ;
    /usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-pre"
ExecStart=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start" ;
    !/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-star-2"
ExecStartPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-post"
ExecReload=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-reload"
ExecStop=!/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop"
ExecStopPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop-post"

[Install]
WantedBy=multi-user.target]
`````````````````````````````````````````````````````````````````
2016-06-10 18:19:54 +02:00
Topi Miettinen f3e4363593 core: Restrict mmap and mprotect with PAGE_WRITE|PAGE_EXEC (#3319) (#3379)
New exec boolean MemoryDenyWriteExecute, when set, installs
a seccomp filter to reject mmap(2) with PAGE_WRITE|PAGE_EXEC
and mprotect(2) with PAGE_EXEC.
2016-06-03 17:58:18 +02:00
Topi Miettinen 201c1cc22a core: add pre-defined syscall groups to SystemCallFilter= (#3053) (#3157)
Implement sets of system calls to help constructing system call
filters. A set starts with '@' to distinguish from a system call.

Closes: #3053, #3157
2016-06-01 11:56:01 +02:00
Alessandro Puccetti 043cc71512 doc: clarify systemd.exec's paths definition (#3368)
Definitions of ReadWriteDirectories=, ReadOnlyDirectories=, InaccessibleDirectories=,
WorkingDirectory=, and RootDirecory= were not clear. This patch specifies when
they are relative to the host's root directory and when they are relative to the service's
root directory.

Fixes #3248
2016-05-30 16:37:07 +02:00
Luca Bruno 008dce3875
man: fix recurring typo 2016-05-30 13:43:53 +02:00
topimiettinen 737ba3c82c namespace: Make private /dev noexec and readonly (#3263)
Private /dev will not be managed by udev or others, so we can make it
noexec and readonly after we have made all device nodes. As /dev/shm
needs to be writable, we can't use bind_remount_recursive().
2016-05-15 22:34:05 -04:00
Lennart Poettering 2985700185 core: make parsing of RLIMIT_NICE aware of actual nice levels 2016-04-29 16:27:49 +02:00
Lennart Poettering dfe85b38d2 man: minor wording fixes
As suggested in:

https://github.com/systemd/systemd/pull/3124#discussion_r61068789
2016-04-29 12:23:34 +02:00
Lennart Poettering 28c75e2501 man: elaborate on the automatic systemd-journald.socket service dependencies
Fixes: #1603
2016-04-26 12:00:49 +02:00
Nicolas Braud-Santoni b50a16af8e man: systemd.exec: Clarify InaccessibleDirectories (#3048) (#3048) 2016-04-17 14:22:17 +02:00
Ronny Chevalier 19c0b0b9a5 core: set NoNewPrivileges for seccomp if we don't have CAP_SYS_ADMIN
The manpage of seccomp specify that using seccomp with
SECCOMP_SET_MODE_FILTER will return EACCES if the caller do not have
CAP_SYS_ADMIN set, or if the no_new_privileges bit is not set. Hence,
without NoNewPrivilege set, it is impossible to use a SystemCall*
directive with a User directive set in system mode.

Now, NoNewPrivileges is set if we are in user mode, or if we are in
system mode and we don't have CAP_SYS_ADMIN, and SystemCall*
directives are used.
2016-02-28 14:44:26 +01:00
Lennart Poettering 7882632d5a man: extend the Personality= documentation
Among other fixes, add information about more architectures that are supported
these days.
2016-02-22 23:23:06 +01:00
Lennart Poettering 479050b363 core: drop Capabilities= setting
The setting is hardly useful (since its effect is generally reduced to zero due
to file system caps), and with the advent of ambient caps an actually useful
replacement exists, hence let's get rid of this.

I am pretty sure this was unused and our man page already recommended against
its use, hence this should be a safe thing to remove.
2016-02-13 11:59:34 +01:00
Ismo Puustinen ece87975a9 man: add AmbientCapabilities entry. 2016-01-12 12:14:50 +02:00
Karel Zak 91518d20dd core: support <soft:hard> ranges for RLIMIT options
The new parser supports:

 <value>       - specify both limits to the same value
 <soft:hard>   - specify both limits

the size or time specific suffixes are supported, for example

  LimitRTTIME=1sec
  LimitAS=4G:16G

The patch introduces parse_rlimit_range() and rlim type (size, sec,
usec, etc.) specific parsers. No code is duplicated now.

The patch also sync docs for DefaultLimitXXX= and LimitXXX=.

References: https://github.com/systemd/systemd/issues/1769
2015-11-25 12:03:32 +01:00
Evgeny Vereshchagin 5c019cf260 man: systemd.exec: add missing variables 2015-11-19 13:37:16 +00:00
Tom Gundersen fb5c8184a9 Merge pull request #1854 from poettering/unit-deps
Dependency engine improvements
2015-11-11 23:14:12 +01:00
Lennart Poettering c129bd5df3 man: document automatic dependencies
For all units ensure there's an "Automatic Dependencies" section in the
man page, and explain which dependencies are automatically added in all
cases, and which ones are added on top if DefaultDependencies=yes is
set.

This is also done for systemd.exec(5), systemd.resource-control(5) and
systemd.unit(5) as these pages describe common behaviour of various unit
types.
2015-11-11 20:47:07 +01:00
Filipe Brandenburger b4c14404b3 execute: Add new PassEnvironment= directive
This directive allows passing environment variables from the system
manager to spawned services. Variables in the system manager can be set
inside a container by passing `--set-env=...` options to systemd-spawn.

Tested with an on-disk test.service unit. Tested using multiple variable
names on a single line, with an empty setting to clear the current list
of variables, with non-existing variables.

Tested using `systemd-run -p PassEnvironment=VARNAME` to confirm it
works with transient units.

Confirmed that `systemctl show` will display the PassEnvironment
settings.

Checked that man pages are generated correctly.

No regressions in `make check`.
2015-11-11 07:55:23 -08:00
Lennart Poettering a4c1800284 core: accept time units for time-based resource limits
Let's make sure "LimitCPU=30min" can be parsed properly, following the
usual logic how we parse time values. Similar for LimitRTTIME=.

While we are at it, extend a bit on the man page section about resource
limits.

Fixes: #1772
2015-11-10 17:36:46 +01:00