Systemd

Commit Graph

Author	SHA1	Message	Date
Luca Bruno	52c239d770	core/exec: add a named-descriptor option ("fd") for streams (#4179 ) This commit adds a `fd` option to `StandardInput=`, `StandardOutput=` and `StandardError=` properties in order to connect standard streams to externally named descriptors provided by some socket units. This option looks for a file descriptor named as the corresponding stream. Custom names can be specified, separated by a colon. If multiple name-matches exist, the first matching fd will be used.	2016-10-17 20:05:49 -04:00
Lennart Poettering	c7458f9399	man: avoid abbreviated "cgroups" terminology (#4396 ) Let's avoid the overly abbreviated "cgroups" terminology. Let's instead write: "Linux Control Groups (cgroups)" is the long form wherever the term is introduced in prose. Use "control groups" in the short form wherever the term is used within brief explanations. Follow-up to: #4381	2016-10-17 09:50:26 -04:00
Zbigniew Jędrzejewski-Szmek	74b47bbd5d	man: add crosslink between systemd.resource-control(5) and systemd.exec(5) Fixes #4379.	2016-10-15 18:38:20 -04:00
Lennart Poettering	8bfdf29b24	Merge pull request #4243 from endocode/djalal/sandbox-first-protection-kernelmodules-v1 core:sandbox: Add ProtectKernelModules= and some fixes	2016-10-13 18:36:29 +02:00
Thomas Hindoe Paaboel Andersen	2dd678171e	man: typo fixes A mix of fixes for typos and UK english	2016-10-12 23:02:44 +02:00
Djalal Harouni	c575770b75	core:sandbox: lets make /lib/modules/ inaccessible on ProtectKernelModules= Lets go further and make /lib/modules/ inaccessible for services that do not have business with modules, this is a minor improvment but it may help on setups with custom modules and they are limited... in regard of kernel auto-load feature. This change introduce NameSpaceInfo struct which we may embed later inside ExecContext but for now lets just reduce the argument number to setup_namespace() and merge ProtectKernelModules feature.	2016-10-12 14:11:16 +02:00
Djalal Harouni	ac246d9868	doc: minor hint about InaccessiblePaths= in regard of ProtectKernelTunables=	2016-10-12 13:52:40 +02:00
Djalal Harouni	2cd0a73547	core:sandbox: remove CAP_SYS_RAWIO on PrivateDevices=yes The rawio system calls were filtered, but CAP_SYS_RAWIO allows to access raw data through /proc, ioctl and some other exotic system calls...	2016-10-12 13:39:49 +02:00
Djalal Harouni	502d704e5e	core:sandbox: Add ProtectKernelModules= option This is useful to turn off explicit module load and unload operations on modular kernels. This option removes CAP_SYS_MODULE from the capability bounding set for the unit, and installs a system call filter to block module system calls. This option will not prevent the kernel from loading modules using the module auto-load feature which is a system wide operation.	2016-10-12 13:31:21 +02:00
Zbigniew Jędrzejewski-Szmek	56b4c80b42	Merge pull request #4348 from poettering/docfixes Various smaller documentation fixes.	2016-10-11 13:49:15 -04:00
Lennart Poettering	f4c9356d13	man: beef up documentation on per-unit resource limits a bit Let's clarify that for user services some OS-defined limits bound the settings in the unit files. Fixes: #4232	2016-10-11 18:42:22 +02:00
Lennart Poettering	4b58153dd2	core: add "invocation ID" concept to service manager This adds a new invocation ID concept to the service manager. The invocation ID identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is generated each time a unit moves from and inactive to an activating or active state. The primary usecase for this concept is to connect the runtime data PID 1 maintains about a service with the offline data the journal stores about it. Previously we'd use the unit name plus start/stop times, which however is highly racy since the journal will generally process log data after the service already ended. The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel, except that it applies to an individual unit instead of the whole system. The invocation ID is passed to the activated processes as environment variable. It is additionally stored as extended attribute on the cgroup of the unit. The latter is used by journald to automatically retrieve it for each log logged message and attach it to the log entry. The environment variable is very easily accessible, even for unprivileged services. OTOH the extended attribute is only accessible to privileged processes (this is because cgroupfs only supports the "trusted." xattr namespace, not "user."). The environment variable may be altered by services, the extended attribute may not be, hence is the better choice for the journal. Note that reading the invocation ID off the extended attribute from journald is racy, similar to the way reading the unit name for a logging process is. This patch adds APIs to read the invocation ID to sd-id128: sd_id128_get_invocation() may be used in a similar fashion to sd_id128_get_boot(). PID1's own logging is updated to always include the invocation ID when it logs information about a unit. A new bus call GetUnitByInvocationID() is added that allows retrieving a bus path to a unit by its invocation ID. The bus path is built using the invocation ID, thus providing a path for referring to a unit that is valid only for the current runtime cycleof it. Outlook for the future: should the kernel eventually allow passing of cgroup information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we can alter the invocation ID to be generated as hash from that rather than entirely randomly. This way we can derive the invocation race-freely from the messages.	2016-10-07 20:14:38 +02:00
hbrueckner	6abfd30372	seccomp: add support for the s390 architecture (#4287 ) Add seccomp support for the s390 architecture (31-bit and 64-bit) to systemd. This requires libseccomp >= 2.3.1.	2016-10-05 13:58:55 +02:00
Stefan Schweter	cfaf4b75e0	man: remove consecutive duplicate words (#4268 ) This PR removes consecutive duplicate words from the man pages of: * `resolved.conf.xml` * `systemd.exec.xml` * `systemd.socket.xml`	2016-10-03 17:09:54 +02:00
Djalal Harouni	8f81a5f61b	core: Use @raw-io syscall group to filter I/O syscalls when PrivateDevices= is set Instead of having a local syscall list, use the @raw-io group which contains the same set of syscalls to filter.	2016-09-25 12:52:27 +02:00
Djalal Harouni	49accde7bd	core:sandbox: add more /proc/* entries to ProtectKernelTunables= Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface, filesystems configuration and IRQ tuning readonly. Most of these interfaces now days should be in /sys but they are still available through /proc, so just protect them. This patch does not touch /proc/net/...	2016-09-25 11:30:11 +02:00
Djalal Harouni	9221aec8d0	doc: explicitly document that /dev/mem and /dev/port are blocked by PrivateDevices=true	2016-09-25 11:25:44 +02:00
Djalal Harouni	e778185bb5	doc: documentation fixes for ReadWritePaths= and ProtectKernelTunables= Documentation fixes for ReadWritePaths= and ProtectKernelTunables= as reported by Evgeny Vereshchagin.	2016-09-25 11:25:31 +02:00
Lennart Poettering	6757c06a1a	man: shorten the exit status table a bit Let's merge a couple of columns, to make the table a bit shorter. This effectively just drops whitespace, not contents, but makes the currently humungous table much much more compact.	2016-09-25 10:52:57 +02:00
Lennart Poettering	81c8aceed4	man: the exit code/signal is stored in $EXIT_CODE, not $EXIT_STATUS	2016-09-25 10:52:57 +02:00
Lennart Poettering	effbd6d2ea	man: rework documentation for ReadOnlyPaths= and related settings This reworks the documentation for ReadOnlyPaths=, ReadWritePaths=, InaccessiblePaths=. It no longer claims that we'd follow symlinks relative to the host file system. (Which wasn't true actually, as we didn't follow symlinks at all in the most recent releases, and we know do follow them, but relative to RootDirectory=). This also replaces all references to the fact that all fs namespacing options can be undone with enough privileges and disable propagation by a single one in the documentation of ReadOnlyPaths= and friends, and then directs the read to this in all other places. Moreover a hint is added to the documentation of SystemCallFilter=, suggesting usage of ~@mount in case any of the fs namespacing related options are used.	2016-09-25 10:42:18 +02:00
Lennart Poettering	b2656f1b1c	man: in user-facing documentaiton don't reference C function names Let's drop the reference to the cap_from_name() function in the documentation for the capabilities setting, as it is hardly helpful. Our readers are not necessarily C hackers knowing the semantics of cap_from_name(). Moreover, the strings we accept are just the plain capability names as listed in capabilities(7) hence there's really no point in confusing the user with anything else.	2016-09-25 10:42:18 +02:00
Lennart Poettering	63bb64a056	core: imply ProtectHome=read-only and ProtectSystem=strict if DynamicUser=1 Let's make sure that services that use DynamicUser=1 cannot leave files in the file system should the system accidentally have a world-writable directory somewhere. This effectively ensures that directories need to be whitelisted rather than blacklisted for access when DynamicUser=1 is set.	2016-09-25 10:42:18 +02:00
Lennart Poettering	3f815163ff	core: introduce ProtectSystem=strict Let's tighten our sandbox a bit more: with this change ProtectSystem= gains a new setting "strict". If set, the entire directory tree of the system is mounted read-only, but the API file systems /proc, /dev, /sys are excluded (they may be managed with PrivateDevices= and ProtectKernelTunables=). Also, /home and /root are excluded as those are left for ProtectHome= to manage. In this mode, all "real" file systems (i.e. non-API file systems) are mounted read-only, and specific directories may only be excluded via ReadWriteDirectories=, thus implementing an effective whitelist instead of blacklist of writable directories. While we are at, also add /efi to the list of paths always affected by ProtectSystem=. This is a follow-up for `b52a109ad3` which added /efi as alternative for /boot. Our namespacing logic should respect that too.	2016-09-25 10:42:18 +02:00
Lennart Poettering	59eeb84ba6	core: add two new service settings ProtectKernelTunables= and ProtectControlGroups= If enabled, these will block write access to /sys, /proc/sys and /proc/sys/fs/cgroup.	2016-09-25 10:18:48 +02:00
Lennart Poettering	00d9ef8560	core: add RemoveIPC= setting This adds the boolean RemoveIPC= setting to service, socket, mount and swap units (i.e. all unit types that may invoke processes). if turned on, and the unit's user/group is not root, all IPC objects of the user/group are removed when the service is shut down. The life-cycle of the IPC objects is hence bound to the unit life-cycle. This is particularly relevant for units with dynamic users, as it is essential that no objects owned by the dynamic users survive the service exiting. In fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set. In order to communicate the UID/GID of an executed process back to PID 1 this adds a new "user lookup" socket pair, that is inherited into the forked processes, and closed before the exec(). This is needed since we cannot do NSS from PID 1 due to deadlock risks, However need to know the used UID/GID in order to clean up IPC owned by it if the unit shuts down.	2016-08-19 00:37:25 +02:00
Zbigniew Jędrzejewski-Szmek	29df65f913	man: add "timeout" to status table (#3919 )	2016-08-11 10:51:49 +02:00
Lennart Poettering	56bf97e10f	Merge pull request #3914 from keszybz/fix-man-links Fix man links	2016-08-07 11:17:56 +02:00
Zbigniew Jędrzejewski-Szmek	e64e1bfd86	man: add a table of possible exit statuses (#3910 )	2016-08-07 11:14:40 +02:00
Zbigniew Jędrzejewski-Szmek	d87a2ef782	Merge pull request #3884 from poettering/private-users	2016-08-06 17:04:45 -04:00
Zbigniew Jędrzejewski-Szmek	0a07667d8d	man: provide html links to a bunch of external man pages	2016-08-06 16:39:53 -04:00
Lennart Poettering	136dc4c435	core: set $SERVICE_RESULT, $EXIT_CODE and $EXIT_STATUS in ExecStop=/ExecStopPost= commands This should simplify monitoring tools for services, by passing the most basic information about service result/exit information via environment variables, thus making it unnecessary to retrieve them explicitly via the bus.	2016-08-04 23:08:05 +02:00
Lennart Poettering	d251207d55	core: add new PrivateUsers= option to service execution This setting adds minimal user namespacing support to a service. When set the invoked processes will run in their own user namespace. Only a trivial mapping will be set up: the root user/group is mapped to root, and the user/group of the service will be mapped to itself, everything else is mapped to nobody. If this setting is used the service runs with no capabilities on the host, but configurable capabilities within the service. This setting is particularly useful in conjunction with RootDirectory= as the need to synchronize /etc/passwd and /etc/group between the host and the service OS tree is reduced, as only three UID/GIDs need to match: root, nobody and the user of the service itself. But even outside the RootDirectory= case this setting is useful to substantially reduce the attack surface of a service. Example command to test this: systemd-run -p PrivateUsers=1 -p User=foobar -t /bin/sh This runs a shell as user "foobar". When typing "ps" only processes owned by "root", by "foobar", and by "nobody" should be visible.	2016-08-03 20:42:04 +02:00
Zbigniew Jędrzejewski-Szmek	dadd6ecfa5	Merge pull request #3728 from poettering/dynamic-users	2016-07-25 16:40:26 -04:00
Lennart Poettering	43eb109aa9	core: change ExecStart=! syntax to ExecStart=+ (#3797 ) As suggested by @mbiebl we already use the "!" special char in unit file assignments for negation, hence we should not use it in a different context for privileged execution. Let's use "+" instead.	2016-07-25 16:53:33 +02:00
Lennart Poettering	29206d4619	core: add a concept of "dynamic" user ids, that are allocated as long as a service is running This adds a new boolean setting DynamicUser= to service files. If set, a new user will be allocated dynamically when the unit is started, and released when it is stopped. The user ID is allocated from the range 61184..65519. The user will not be added to /etc/passwd (but an NSS module to be added later should make it show up in getent passwd). For now, care should be taken that the service writes no files to disk, since this might result in files owned by UIDs that might get assigned dynamically to a different service later on. Later patches will tighten sandboxing in order to ensure that this cannot happen, except for a few selected directories. A simple way to test this is: systemd-run -p DynamicUser=1 /bin/sleep 99999	2016-07-22 15:53:45 +02:00
Alessandro Puccetti	2a624c36e6	doc,core: Read{Write,Only}Paths= and InaccessiblePaths= This patch renames Read{Write,Only}Directories= and InaccessibleDirectories= to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept as aliases but they are not advertised in the documentation. Renamed variables: `read_write_dirs` --> `read_write_paths` `read_only_dirs` --> `read_only_paths` `inaccessible_dirs` --> `inaccessible_paths`	2016-07-19 17:22:02 +02:00
Alessandro Puccetti	c4b4170746	namespace: unify limit behavior on non-directory paths Despite the name, `Read{Write,Only}Directories=` already allows for regular file paths to be masked. This commit adds the same behavior to `InaccessibleDirectories=` and makes it explicit in the doc. This patch introduces `/run/systemd/inaccessible/{reg,dir,chr,blk,fifo,sock}` {dile,device}nodes and mounts on the appropriate one the paths specified in `InacessibleDirectories=`. Based on Luca's patch from https://github.com/systemd/systemd/pull/3327	2016-07-19 17:22:02 +02:00
Lennart Poettering	f4170c671b	execute: add a new easy-to-use RestrictRealtime= option to units It takes a boolean value. If true, access to SCHED_RR, SCHED_FIFO and SCHED_DEADLINE is blocked, which my be used to lock up the system.	2016-06-23 01:45:45 +02:00
Lennart Poettering	7bce046bcf	core: set $JOURNAL_STREAM to the dev_t/ino_t of the journal stream of executed services This permits services to detect whether their stdout/stderr is connected to the journal, and if so talk to the journal directly, thus permitting carrying of metadata. As requested by the gtk folks: #2473	2016-06-15 23:00:27 +02:00
Lennart Poettering	1f9ac68b5b	core: improve seccomp syscall grouping a bit This adds three new seccomp syscall groups: @keyring for kernel keyring access, @cpu-emulation for CPU emulation features, for exampe vm86() for dosemu and suchlike, and @debug for ptrace() and related calls. Also, the @clock group is updated with more syscalls that alter the system clock. capset() is added to @privileged, and pciconfig_iobase() is added to @raw-io. Finally, @obsolete is a cleaned up. A number of syscalls that never existed on Linux and have no number assigned on any architecture are removed, as they only exist in the man pages and other operating sytems, but not in code at all. create_module() is moved from @module to @obsolete, as it is an obsolete system call. mem_getpolicy() is removed from the @obsolete list, as it is not obsolete, but simply a NUMA API.	2016-06-13 16:25:54 +02:00
Alessandro Puccetti	cf677fe686	core/execute: add the magic character '!' to allow privileged execution (#3493 ) This patch implements the new magic character '!'. By putting '!' in front of a command, systemd executes it with full privileges ignoring paramters such as User, Group, SupplementaryGroups, CapabilityBoundingSet, AmbientCapabilities, SecureBits, SystemCallFilter, SELinuxContext, AppArmorProfile, SmackProcessLabel, and RestrictAddressFamilies. Fixes partially https://github.com/systemd/systemd/issues/3414 Related to https://github.com/coreos/rkt/issues/2482 Testing: 1. Create a user 'bob' 2. Create the unit file /etc/systemd/system/exec-perm.service (You can use the example below) 3. sudo systemctl start ext-perm.service 4. Verify that the commands starting with '!' were not executed as bob, 4.1 Looking to the output of ls -l /tmp/exec-perm 4.2 Each file contains the result of the id command. ````````````````````````````````````````````````````````````````` [Unit] Description=ext-perm [Service] Type=oneshot TimeoutStartSec=0 User=bob ExecStartPre=!/usr/bin/sh -c "/usr/bin/rm /tmp/exec-perm*" ; /usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-pre" ExecStart=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start" ; !/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-star-2" ExecStartPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-post" ExecReload=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-reload" ExecStop=!/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop" ExecStopPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop-post" [Install] WantedBy=multi-user.target] `````````````````````````````````````````````````````````````````	2016-06-10 18:19:54 +02:00
Topi Miettinen	f3e4363593	core: Restrict mmap and mprotect with PAGE_WRITE\|PAGE_EXEC (#3319 ) (#3379 ) New exec boolean MemoryDenyWriteExecute, when set, installs a seccomp filter to reject mmap(2) with PAGE_WRITE\|PAGE_EXEC and mprotect(2) with PAGE_EXEC.	2016-06-03 17:58:18 +02:00
Topi Miettinen	201c1cc22a	core: add pre-defined syscall groups to SystemCallFilter= (#3053 ) (#3157 ) Implement sets of system calls to help constructing system call filters. A set starts with '@' to distinguish from a system call. Closes: #3053, #3157	2016-06-01 11:56:01 +02:00
Alessandro Puccetti	043cc71512	doc: clarify systemd.exec's paths definition (#3368 ) Definitions of ReadWriteDirectories=, ReadOnlyDirectories=, InaccessibleDirectories=, WorkingDirectory=, and RootDirecory= were not clear. This patch specifies when they are relative to the host's root directory and when they are relative to the service's root directory. Fixes #3248	2016-05-30 16:37:07 +02:00
Luca Bruno	008dce3875	man: fix recurring typo	2016-05-30 13:43:53 +02:00
topimiettinen	737ba3c82c	namespace: Make private /dev noexec and readonly (#3263 ) Private /dev will not be managed by udev or others, so we can make it noexec and readonly after we have made all device nodes. As /dev/shm needs to be writable, we can't use bind_remount_recursive().	2016-05-15 22:34:05 -04:00
Lennart Poettering	2985700185	core: make parsing of RLIMIT_NICE aware of actual nice levels	2016-04-29 16:27:49 +02:00
Lennart Poettering	dfe85b38d2	man: minor wording fixes As suggested in: https://github.com/systemd/systemd/pull/3124#discussion_r61068789	2016-04-29 12:23:34 +02:00
Lennart Poettering	28c75e2501	man: elaborate on the automatic systemd-journald.socket service dependencies Fixes: #1603	2016-04-26 12:00:49 +02:00

1 2 3 4 5

211 Commits