Systemd

Author	SHA1	Message	Date
WaLyong Cho	80c21aea11	core: dump also missed security context	2017-07-13 13:12:24 +09:00
WaLyong Cho	5b8e1b7755	core: modify resource leak by SmackProcessLabel=	2017-07-13 13:12:15 +09:00
Lennart Poettering	782c925f7f	Revert "core: link user keyring to session keyring (#6275 )" (#6342 ) This reverts commit `437a85112e`. The outcome of this isn't that clear, let's revert this for now, see discussion on #6286.	2017-07-12 10:00:43 -04:00
Christian Hesse	437a85112e	core: link user keyring to session keyring (#6275 ) Commit `74dd6b515f` (core: run each system service with a fresh session keyring) broke adding keys to user keyring. Added keys could not be accessed with error message: keyctl_read_alloc: Permission denied So link the user keyring to our session keyring.	2017-07-04 09:38:31 +02:00
Lennart Poettering	7f452159b8	core: make IOSchedulingClass= and IOSchedulingPriority= settable for transient units This patch is a bit more complex thant I hoped. In particular the single IOScheduling= property exposed on the bus is split up into IOSchedulingClass= and IOSchedulingPriority= (though compat is retained). Otherwise the asymmetry between setting props and getting them is a bit too nasty. Fixes #5613	2017-06-26 17:43:18 +02:00
Zbigniew Jędrzejewski-Szmek	7e867138f5	Merge pull request #5600 from fbuihuu/make-logind-restartable Make logind restartable.	2017-06-24 18:58:36 -04:00
Franck Bui	4c47affcf1	core: remove the redundancy of 'n_fds' and 'n_storage_fds' in ExecParameters struct 'n_fds' field in the ExecParameters structure was counting the total number of file descriptors to be passed to a unit. This counter also includes the number of passed socket fds which is counted by 'n_socket_fds' already. This patch removes that redundancy by replacing 'n_fds' with 'n_storage_fds'. The new field only counts the fds passed via the storage store mechanism. That way each fd is counted at one place only. Subsequently the patch makes sure to fix code that used 'n_fds' and also wanted to iterate through all of them by explicitly adding 'n_socket_fds' + 'n_storage_fds'. Suggested by Lennart.	2017-06-08 16:21:35 +02:00
Franck Bui	9b1419111a	core: only apply NonBlocking= to fds passed via socket activation Make sure to only apply the O_NONBLOCK flag to the fds passed via socket activation. Previously the flag was also applied to the fds which came from the fd store but this was incorrect since services, after being restarted, expect that these passed fds have their flags unchanged and can be reused as before. The documentation was a bit unclear about this so clarify it.	2017-06-06 22:42:50 +02:00
Zbigniew Jędrzejewski-Szmek	52511fae7b	core: fix warning about unsigned variable (#5935 ) Fixup for `d8c92e8bc7`.	2017-05-11 08:15:28 +02:00
Lennart Poettering	4e168f4606	Merge pull request #5420 from OpenDZ/tixxdz/namespace-fixes-v2 Namespace: RootImage= RootDirectory= and MountAPIVFS fixes	2017-05-09 20:42:32 +02:00
Aggelos Avgerinos	488ab41cb8	execute: Properly log errors considering socket fds (#5910 ) Till now if the params->n_fds was 0, systemd was logging that there were more than one sockets. Thanks @gregoryp and @VFXcode who did the most work debugging this.	2017-05-08 19:09:22 -04:00
Zbigniew Jędrzejewski-Szmek	d8c92e8bc7	execute: filter out "." for ".." in EnvironmentFile= globs too This doesn't really matter much, only in case somebody would use something strange like EnvironmentFile=/etc/something/.* Make sure that "." and ".." is not returned by that glob. This makes all our globbing patterns behave the same.	2017-04-27 13:21:08 -04:00
Djalal Harouni	74e941c022	Merge pull request #5774 from keszybz/printf-annotations Printf annotation improvements	2017-04-23 01:03:42 +02:00
Zbigniew Jędrzejewski-Szmek	ba360bb05c	tree-wide: mark log_struct with _printf_ and fix fallout log_struct takes multiple format strings, each one followed by arguments. The _printf_ annotation is not sufficiently flexible to express this, but we can still annotate the first format string, though not its arguments (because their number is unknown). With the annotation, the places which specified the message id or similar as the first pattern cause a warning from -Wformat-nonliteral. This can be trivially fixed by putting the MESSAGE= first. This change will help find issues where a non-literal is erroneously used as the pattern.	2017-04-21 13:37:04 -04:00
Yu Watanabe	4d8b0f0f7a	core: downgrade error message if command is prefixed with `-` and the command is not found Fixes #5621	2017-04-03 15:38:37 +09:00
Djalal Harouni	9c988f934b	namespace: Apply MountAPIVFS= only when a Root directory is set The MountAPIVFS= documentation says that this options has no effect unless used in conjunction with RootDirectory= or RootImage= ,lets fix this and avoid to create private mount namespaces where it is not needed.	2017-03-05 21:39:43 +01:00
Zbigniew Jędrzejewski-Szmek	643f4706b0	core/execute: add (void) CID #778045.	2017-02-20 16:02:18 -05:00
Zbigniew Jędrzejewski-Szmek	2b0445262a	tree-wide: add SD_ID128_MAKE_STR, remove LOG_MESSAGE_ID Embedding sd_id128_t's in constant strings was rather cumbersome. We had SD_ID128_CONST_STR which returned a const char[], but it had two problems: - it wasn't possible to statically concatanate this array with a normal string - gcc wasn't really able to optimize this, and generated code to perform the "conversion" at runtime. Because of this, even our own code in coredumpctl wasn't using SD_ID128_CONST_STR. Add a new macro to generate a constant string: SD_ID128_MAKE_STR. It is not as elegant as SD_ID128_CONST_STR, because it requires a repetition of the numbers, but in practice it is more convenient to use, and allows gcc to generate smarter code: $ size .libs/systemd{,-logind,-journald}{.old,} text data bss dec hex filename 1265204 149564 4808 1419576 15a938 .libs/systemd.old 1260268 149564 4808 1414640 1595f0 .libs/systemd 246805 13852 209 260866 3fb02 .libs/systemd-logind.old 240973 13852 209 255034 3e43a .libs/systemd-logind 146839 4984 34 151857 25131 .libs/systemd-journald.old 146391 4984 34 151409 24f71 .libs/systemd-journald It is also much easier to check if a certain binary uses a certain MESSAGE_ID: $ strings .libs/systemd.old\|grep MESSAGE_ID MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x $ strings .libs/systemd\|grep MESSAGE_ID MESSAGE_ID=c7a787079b354eaaa9e77b371893cd27 MESSAGE_ID=b07a249cd024414a82dd00cd181378ff MESSAGE_ID=641257651c1b4ec9a8624d7a40a9e1e7 MESSAGE_ID=de5b426a63be47a7b6ac3eaac82e2f6f MESSAGE_ID=d34d037fff1847e6ae669a370e694725 MESSAGE_ID=7d4958e842da4a758f6c1cdc7b36dcc5 MESSAGE_ID=1dee0369c7fc4736b7099b38ecb46ee7 MESSAGE_ID=39f53479d3a045ac8e11786248231fbf MESSAGE_ID=be02cf6855d2428ba40df7e9d022f03d MESSAGE_ID=7b05ebc668384222baa8881179cfda54 MESSAGE_ID=9d1aaa27d60140bd96365438aad20286	2017-02-15 00:45:12 -05:00
Lennart Poettering	6818c54ca6	core: skip ReadOnlyPaths= and other permission-related mounts on PermissionsStartOnly= (#5309 ) ReadOnlyPaths=, ProtectHome=, InaccessiblePaths= and ProtectSystem= are about restricting access and little more, hence they should be disabled if PermissionsStartOnly= is used or ExecStart= lines are prefixed with a "+". Do that. (Note that we will still create namespaces and stuff, since that's about a lot more than just permissions. We'll simply disable the effect of the four options mentioned above, but nothing else mount related.) This also adds a test for this, to ensure this works as intended. No documentation updates, as the documentation are already vague enough to support the new behaviour ("If true, the permission-related execution options…"). We could clarify this further, but I think we might want to extend the switches' behaviour a bit more in future, hence leave it at this for now. Fixes: #5308	2017-02-12 00:44:46 -05:00
Lennart Poettering	376fecf670	execute: set the right exit status for CHDIR vs. CHROOT Fixes: #5125	2017-02-09 13:18:35 +01:00
Lennart Poettering	3b0e5bb524	execute: use prefix_roota() where appropriate	2017-02-09 13:18:35 +01:00
Lennart Poettering	6732edab4e	execute: set working directory to /root if User= is not set, but WorkingDirectory=~ is Or actually, try to to do the right thing depending on what is available: - If we know $HOME from User=, then use that. - If the UID for the service is 0, hardcode that WorkingDirectory=~ means WorkingDirectory=/root - In any other case (which will be the unprivileged --user case), use get_home_dir() to find the $HOME of the user we are running as. - Otherwise fail. Fixes: #5246 #5124	2017-02-09 13:17:58 +01:00
Lennart Poettering	23deef88b9	Revert "core/execute: set HOME, USER also for root users" This reverts commit `8b89628a10`. This broke #5246	2017-02-09 11:43:44 +01:00
Lennart Poettering	915e6d1676	core: add RootImage= setting for using a specific image file as root directory for a service This is similar to RootDirectory= but mounts the root file system from a block device or loopback file instead of another directory. This reuses the image dissector code now used by nspawn and gpt-auto-discovery.	2017-02-07 12:19:42 +01:00
Lennart Poettering	5d997827e2	core: add a per-unit setting MountAPIVFS= for mounting /dev, /proc, /sys in conjunction with RootDirectory= This adds a boolean unit file setting MountAPIVFS=. If set, the three main API VFS mounts will be mounted for the service. This only has an effect on RootDirectory=, which it makes a ton times more useful. (This is basically the /dev + /proc + /sys mounting code posted in the original #4727, but rebased on current git, and with the automatic logic replaced by explicit logic controlled by a unit file setting)	2017-02-07 11:22:05 +01:00
Zbigniew Jędrzejewski-Szmek	6a93917df9	core/execute: pass the username to utmp/wtmp database Before previous commit, username would be NULL for root, and set only for other users. So the argument passed to utmp_put_init_process() would be "root" for other users and NULL for root. Seems strange. Instead, always pass the username if available.	2017-02-03 11:49:43 -05:00
Zbigniew Jędrzejewski-Szmek	8b89628a10	core/execute: set HOME, USER also for root users This changes the environment for services running as root from: LANG=C.utf8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin INVOCATION_ID=ffbdec203c69499a9b83199333e31555 JOURNAL_STREAM=8:1614518 to LANG=C.utf8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin HOME=/root LOGNAME=root USER=root SHELL=/bin/sh INVOCATION_ID=15a077963d7b4ca0b82c91dc6519f87c JOURNAL_STREAM=8:1616718 Making the environment special for the root user complicates things unnecessarily. This change simplifies both our logic (by making the setting of the variables unconditional), and should also simplify the logic in services (particularly scripts). Fixes #5124.	2017-02-03 11:49:22 -05:00
Zbigniew Jędrzejewski-Szmek	587ab01b53	core/execute.c: check asprintf return value in the usual fashion This is unlikely to fail, but we cannot rely on asprintf return value on failure, so let's just be correct here. CID #1368227.	2017-01-31 11:31:47 -05:00
Zbigniew Jędrzejewski-Szmek	56fbd56143	core/execute: reformat exec_context_named_iofds() for legibility	2017-01-31 11:23:10 -05:00
Zbigniew Jędrzejewski-Szmek	06ec51d8ef	core/execute: fix strv memleak compile_read_write_paths() returns a normal strv from strv_copy(), and setup_namespace() uses it read-only, so we should use strv_free to deallocate.	2017-01-24 22:26:10 -05:00
Zbigniew Jędrzejewski-Szmek	5b3637b44a	Merge pull request #4991 from poettering/seccomp-fix	2017-01-17 23:10:46 -05:00
Zbigniew Jędrzejewski-Szmek	70dd455c8e	pid1: provide a more detailed error message when execution fails (#5074 ) Fixes #5000.	2017-01-17 22:38:55 -05:00
Lennart Poettering	469830d142	seccomp: rework seccomp code, to improve compat with some archs This substantially reworks the seccomp code, to ensure better compatibility with some architectures, including i386. So far we relied on libseccomp's internal handling of the multiple syscall ABIs supported on Linux. This is problematic however, as it does not define clear semantics if an ABI is not able to support specific seccomp rules we install. This rework hence changes a couple of things: - We no longer use seccomp_rule_add(), but only seccomp_rule_add_exact(), and fail the installation of a filter if the architecture doesn't support it. - We no longer rely on adding multiple syscall architectures to a single filter, but instead install a separate filter for each syscall architecture supported. This way, we can install a strict filter for x86-64, while permitting a less strict filter for i386. - All high-level filter additions are now moved from execute.c to seccomp-util.c, so that we can test them independently of the service execution logic. - Tests have been added for all types of our seccomp filters. - SystemCallFilters= and SystemCallArchitectures= are now implemented in independent filters and installation logic, as they semantically are very much independent of each other. Fixes: #4575	2017-01-17 22:14:27 -05:00
Zbigniew Jędrzejewski-Szmek	4014818d53	Merge pull request #4806 from poettering/keyring-init set up a per-service session kernel keyring, and store the invocation ID in it	2016-12-13 23:24:42 -05:00
Lennart Poettering	d2d6c096f6	core: add ability to define arbitrary bind mounts for services This adds two new settings BindPaths= and BindReadOnlyPaths=. They allow defining arbitrary bind mounts specific to particular services. This is particularly useful for services with RootDirectory= set as this permits making specific bits of the host directory available to chrooted services. The two new settings follow the concepts nspawn already possess in --bind= and --bind-ro=, as well as the .nspawn settings Bind= and BindReadOnly= (and these latter options should probably be renamed to BindPaths= and BindReadOnlyPaths= too). Fixes: #3439	2016-12-14 00:54:10 +01:00
Lennart Poettering	b3415f5dae	core: store the invocation ID in the per-service keyring Let's store the invocation ID in the per-service keyring as a root-owned key, with strict access rights. This has the advantage over the environment-based ID passing that it also works from SUID binaries (as they key cannot be overidden by unprivileged code starting them), in contrast to the secure_getenv() based mode. The invocation ID is now passed in three different ways to a service: - As environment variable $INVOCATION_ID. This is easy to use, but may be overriden by unprivileged code (which might be a bad or a good thing), which means it's incompatible with SUID code (see above). - As extended attribute on the service cgroup. This cannot be overriden by unprivileged code, and may be queried safely from "outside" of a service. However, it is incompatible with containers right now, as unprivileged containers generally cannot set xattrs on cgroupfs. - As "invocation_id" key in the kernel keyring. This has the benefit that the key cannot be changed by unprivileged service code, and thus is safe to access from SUID code (see above). But do note that service code can replace the session keyring with a fresh one that lacks the key. However in that case the key will not be owned by root, which is easily detectable. The keyring is also incompatible with containers right now, as it is not properly namespace aware (but this is being worked on), and thus most container managers mask the keyring-related system calls. Ideally we'd only have one way to pass the invocation ID, but the different ways all have limitations. The invocation ID hookup in journald is currently only available on the host but not in containers, due to the mentioned limitations. How to verify the new invocation ID in the keyring: # systemd-run -t /bin/sh Running as unit: run-rd917366c04f847b480d486017f7239d6.service Press ^] three times within 1s to disconnect TTY. # keyctl show Session Keyring 680208392 --alswrv 0 0 keyring: _ses 250926536 ----s-rv 0 0 \_ user: invocation_id # keyctl request user invocation_id 250926536 # keyctl read 250926536 16 bytes of data in key: 9c96317c ac64495a a42b9cd7 4f3ff96b # echo $INVOCATION_ID 9c96317cac64495aa42b9cd74f3ff96b # ^D This creates a new transient service runnint a shell. Then verifies the contents of the keyring, requests the invocation ID key, and reads its payload. For comparison the invocation ID as passed via the environment variable is also displayed.	2016-12-13 20:59:36 +01:00
Lennart Poettering	74dd6b515f	core: run each system service with a fresh session keyring This patch ensures that each system service gets its own session kernel keyring automatically, and implicitly. Without this a keyring is allocated for it on-demand, but is then linked with the user's kernel keyring, which is OK behaviour for logged in users, but not so much for system services. With this change each service gets a session keyring that is specific to the service and ceases to exist when the service is shut down. The session keyring is not linked up with the user keyring and keys hence only search within the session boundaries by default. (This is useful in a later commit to store per-service material in the keyring, for example the invocation ID) (With input from David Howells)	2016-12-13 20:59:10 +01:00
Lennart Poettering	2e6dbc0fcd	Merge pull request #4538 from fbuihuu/confirm-spawn-fixes Confirm spawn fixes/enhancements	2016-11-18 11:08:06 +01:00
Franck Bui	539622bd8c	core: in confirm spawn, suggest 'f' when user selects 'n' choice	2016-11-17 18:23:32 +01:00
Franck Bui	c891efaf8a	core: confirm_spawn: always accept units with same_pgrp set for now For some reasons units remaining in the same process group as PID 1 (same_pgrp=true) fail to acquire the console even if it's not taken by anyone. So always accept for units with same_pgrp set for now.	2016-11-17 18:16:51 +01:00
Franck Bui	63d77c9254	core: include the unit name when notifying that a confirmation question timed out	2016-11-17 18:16:51 +01:00
Franck Bui	b0eb29449e	core: add 'c' in confirmation_spawn to resume the boot process	2016-11-17 18:16:50 +01:00
Franck Bui	56fde33af1	core: add 'j' in confirmation_spawn to list the jobs that are in progress	2016-11-17 18:16:50 +01:00
Franck Bui	dd6f9ac0d0	core: add 'D' in confirmat spawn to show a full dump of the unit to spawn	2016-11-17 18:16:50 +01:00
Franck Bui	eedf223a30	core: add 'i' in confirm spawn to give a short summary of the unit to spawn	2016-11-17 18:16:50 +01:00
Franck Bui	d172b175f6	core: rework the confirmation spawn prompt Previously it was "[Yes, Fail, Skip]" which is pretty misleading because it suggests that the whole word needs to be entered instead of a single char. Also this won't fit well when we'll extend the number of choices. This patch addresses this by changing the choice hint with "[y, f, s – h for help]" so it's now clear that a single letter has to be entered. It also introduces a new choice 'h' which describes all possible choices since a single letter can be not descriptive enough for new users. It also allow to stick with the same hint string regardless of how many choices we will support.	2016-11-17 18:16:50 +01:00
Franck Bui	2bcd3c26fe	core: limit the length of the confirmation question When "confirmation_spawn=1", the confirmation question can look like: Execute /usr/bin/kmod static-nodes --format=tmpfiles --output=/run/tmpfiles.d/kmod.conf? [Yes, No, Skip] which is pretty verbose and might not fit in the console width size (which is usually 80 chars) and thus question will be splitted into 2 consecutive lines. However since the question is now refreshed every 2 secs, the reprinted question will overwrite the second line of the previous one... To prevent this, this patch makes sure that the command line won't be longer than 60 chars by ellipsizing it if the command is longer: Execute /usr/bin/kmod static-nodes --format=tmpfiles --output=/ru…nf? [Yes, No, View, Skip] A following patch will introduce a new choice that will allow the user to get details on the command to be executed so it will still be possible to see the full command line.	2016-11-17 18:16:50 +01:00
Franck Bui	2bcc330942	core: in confirm_spawn, the meaning of 'n' and 's' choices are confusing Before this patch we had: - "no" which gives "failing execution" but the command is actually assumed as succeed. - "skip" which gives "skipping", but the command is assumed to have failed, which ends up with "Failed to start ..." on the console. Now we have: - "fail" which gives "failing execution" and the command is indeed assumed as failed. - "skip" which gives "skipping execution" and the command is assumed as succeed.	2016-11-17 18:16:49 +01:00
Franck Bui	3b20f877ad	core: rework ask_for_confirmation() Now the reponses are handled by ask_for_confirmation() as well as the report of any errors occuring during the process of retrieving the confirmation response. One benefit of this is that there's no need to open/close the console one more time when reporting error/status messages. The caller now just needs to care about the return values whose meanings are: - don't execute and pretend that the command failed - don't execute and pretend that the command succeeed - positive answer, execute the command Also some slight code reorganization and introduce write_confirm_error() and write_confirm_error_fd(). write_confim_message becomes unneeded.	2016-11-17 18:16:49 +01:00
Franck Bui	7d5ceb6416	core: allow to redirect confirmation messages to a different console It's rather hard to parse the confirmation messages (enabled with systemd.confirm_spawn=true) amongst the status messages and the kernel ones (if enabled). This patch gives the possibility to the user to redirect the confirmation message to a different virtual console, either by giving its name or its path, so those messages are separated from the other ones and easier to read.	2016-11-17 18:16:16 +01:00
Djalal Harouni	c92e8afebd	core: improve the logic that implies no new privileges The no_new_privileged_set variable is not used any more since commit `9b232d3241` that fixed another thing. So remove it. Also no need to check if we are under user manager, remove that part too.	2016-11-15 15:04:31 +01:00
Djalal Harouni	af964954c6	core: on DynamicUser= make sure that protecting sensitive paths is enforced (#4596 ) This adds a variable that is always set to false to make sure that protect paths inside sandbox are always enforced and not ignored. The only case when it is set to true is on DynamicUser=no and RootDirectory=/chroot is set. This allows users to use more our sandbox features inside RootDirectory= The only exception is ProtectSystem=full\|strict and when DynamicUser=yes is implied. Currently RootDirectory= is not fully compatible with these due to two reasons: * /chroot/usr\|etc has to be present on ProtectSystem=full * /chroot// has to be a mount point on ProtectSystem=strict.	2016-11-08 21:57:32 -05:00
Zbigniew Jędrzejewski-Szmek	d85a0f8028	Merge pull request #4536 from poettering/seccomp-namespaces core: add new RestrictNamespaces= unit file setting Merging, not rebasing, because this touches many files and there were tree-wide cleanups in the mean time.	2016-11-08 19:54:21 -05:00
Zbigniew Jędrzejewski-Szmek	f97b34a629	Rename formats-util.h to format-util.h We don't have plural in the name of any other -util files and this inconsistency trips me up every time I try to type this file name from memory. "formats-util" is even hard to pronounce.	2016-11-07 10:15:08 -05:00
Lennart Poettering	add005357d	core: add new RestrictNamespaces= unit file setting This new setting permits restricting whether namespaces may be created and managed by processes started by a unit. It installs a seccomp filter blocking certain invocations of unshare(), clone() and setns(). RestrictNamespaces=no is the default, and does not restrict namespaces in any way. RestrictNamespaces=yes takes away the ability to create or manage any kind of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces so that only mount and IPC namespaces may be created/managed, but no other kind of namespaces. This setting should be improve security quite a bit as in particular user namespacing was a major source of CVEs in the kernel in the past, and is accessible to unprivileged processes. With this setting the entire attack surface may be removed for system services that do not make use of namespaces.	2016-11-04 07:40:13 -06:00
Lennart Poettering	493fd52f1a	Merge pull request #4510 from keszybz/tree-wide-cleanups Tree wide cleanups	2016-11-03 13:59:20 -06:00
Djalal Harouni	cdc5d5c55e	core: intialize user aux groups and SupplementaryGroups= when DynamicUser= is set Make sure that when DynamicUser= is set that we intialize the user supplementary groups and that we also support SupplementaryGroups= Fixes: https://github.com/systemd/systemd/issues/4539 Thanks Evgeny Vereshchagin (@evverx)	2016-11-03 08:36:53 +01:00
Lennart Poettering	32e134c19f	Merge pull request #4483 from poettering/exec-order more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec	2016-11-02 16:09:59 -06:00
Djalal Harouni	bbeea27117	core: initialize groups list before checking SupplementaryGroups= of a unit (#4533 ) Always initialize the supplementary groups of caller before checking the unit SupplementaryGroups= option. Fixes https://github.com/systemd/systemd/issues/4531	2016-11-02 10:51:35 -06:00
Lennart Poettering	5cd9cd3537	execute: apply seccomp filters after changing selinux/aa/smack contexts Seccomp is generally an unprivileged operation, changing security contexts is most likely associated with some form of policy. Moreover, while seccomp may influence our own flow of code quite a bit (much more than the security context change) make sure to apply the seccomp filters immediately before executing the binary to invoke. This also moves enforcement of NNP after the security context change, so that NNP cannot affect it anymore. (However, the security policy now has to permit the NNP change). This change has a good chance of breaking current SELinux/AA/SMACK setups, because the policy might not expect this change of behaviour. However, it's technically the better choice I think and should hence be applied. Fixes: #3993	2016-11-02 08:55:00 -06:00
Djalal Harouni	fa1f250d6f	Merge pull request #4495 from topimiettinen/block-shmat-exec seccomp: also block shmat(..., SHM_EXEC) for MemoryDenyWriteExecute	2016-10-28 15:41:07 +02:00
Djalal Harouni	59e856c7d3	core: make unit argument const for apply seccomp functions	2016-10-27 09:40:22 +02:00
Djalal Harouni	50b3dfb9d6	core: lets apply working directory just after mount namespaces This makes applying groups after applying the working directory, this may allow some flexibility but at same it is not a big deal since we don't execute or do anything between applying working directory and droping groups.	2016-10-27 09:40:21 +02:00
Djalal Harouni	2b3c1b9e9d	core: get the working directory value inside apply_working_directory() Improve apply_working_directory() and lets get the current working directory inside of it.	2016-10-27 09:40:21 +02:00
Djalal Harouni	e7f1e7c6e2	core: move apply working directory code into its own apply_working_directory()	2016-10-27 09:40:21 +02:00
Djalal Harouni	93c6bb51b6	core: move the code that setups namespaces on its own function	2016-10-27 09:40:21 +02:00
Topi Miettinen	d2ffa389b8	seccomp: also block shmat(..., SHM_EXEC) for MemoryDenyWriteExecute shmat(..., SHM_EXEC) can be used to create writable and executable memory, so let's block it when MemoryDenyWriteExecute is set.	2016-10-26 18:59:14 +03:00
Lennart Poettering	a3be2849b2	seccomp: add new helper call seccomp_load_filter_set() This allows us to unify most of the code in apply_protect_kernel_modules() and apply_private_devices().	2016-10-24 17:32:50 +02:00
Lennart Poettering	8d7b0c8fd7	seccomp: add new seccomp_init_conservative() helper This adds a new seccomp_init_conservative() helper call that is mostly just a wrapper around seccomp_init(), but turns off NNP and adds in all secondary archs, for best compatibility with everything else. Pretty much all of our code used the very same constructs for these three steps, hence unifying this in one small function makes things a lot shorter. This also changes incorrect usage of the "scmp_filter_ctx" type at various places. libseccomp defines it as typedef to "void", i.e. it is a pointer type (pretty poor choice already!) that casts implicitly to and from all other pointer types (even poorer choice: you defined a confusing type now, and don't even gain any bit of type safety through it...). A lot of the code assumed the type would refer to a structure, and hence aded additional "" here and there. Remove that.	2016-10-24 17:32:50 +02:00
Lennart Poettering	25a8d8a0cb	core: rework apply_protect_kernel_modules() to use seccomp_add_syscall_filter_set() Let's simplify this call, by making use of the new infrastructure. This is actually more in line with Djalal's original patch but instead of search the filter set in the array by its name we can now use the set index and jump directly to it.	2016-10-24 17:32:50 +02:00
Lennart Poettering	8130926d32	core: rework syscall filter set handling A variety of fixes: - rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main instance of it (the syscall_filter_sets[] array) used to abbreviate "SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not mix and match too wildly. Let's pick the shorter name in this case, as it is sufficiently well established to not confuse hackers reading this. - Export explicit indexes into the syscall_filter_sets[] array via an enum. This way, code that wants to make use of a specific filter set, can index it directly via the enum, instead of having to search for it. This makes apply_private_devices() in particular a lot simpler. - Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to find a set by its name, seccomp_add_syscall_filter_set() to add a set to a seccomp object. - Update SystemCallFilter= parser to use extract_first_word(). Let's work on deprecating FOREACH_WORD_QUOTED(). - Simplify apply_private_devices() using this functionality	2016-10-24 17:32:50 +02:00
Lennart Poettering	e0f3720e39	core: move misplaced comment to the right place	2016-10-24 17:29:51 +02:00
Lennart Poettering	f673b62df6	core: simplify skip_seccomp_unavailable() a bit Let's prefer early-exit over deep-indented if blocks. Not behavioural change.	2016-10-24 17:29:51 +02:00
Djalal Harouni	366ddd252e	core: do not assert when sysconf(_SC_NGROUPS_MAX) fails (#4466 ) Remove the assert and check the return code of sysconf(_SC_NGROUPS_MAX). _SC_NGROUPS_MAX maps to NGROUPS_MAX which is defined in <limits.h> to 65536 these days. The value is a sysctl read-only /proc/sys/kernel/ngroups_max and the kernel assumes that it is always positive otherwise things may break. Follow this and support only positive values for all other case return either -errno or -EOPNOTSUPP. Now if there are systems that want to re-write NGROUPS_MAX then they should not pass SupplementaryGroups= in units even if it is empty, in this case nothing fails and we just ignore supplementary groups. However if SupplementaryGroups= is passed even if it is empty we have to assume that there will be groups manipulation from our side or the kernel and since the kernel always assumes that NGROUPS_MAX is positive, then follow that and support only positive values.	2016-10-24 13:13:06 +02:00
Djalal Harouni	8b6903ad4d	core: lets move the setup of working directory before group enforce This is minor but lets try to split and move bit by bit cgroups and portable environment setup before applying the security context.	2016-10-23 23:27:20 +02:00
Djalal Harouni	4d885bd326	core: first lookup and cache creds then apply them after namespace setup This fixes: https://github.com/systemd/systemd/issues/4357 Let's lookup and cache creds then apply them. We also switch from getgroups() to getgrouplist().	2016-10-23 23:24:14 +02:00
Zbigniew Jędrzejewski-Szmek	605405c6cc	tree-wide: drop NULL sentinel from strjoin This makes strjoin and strjoina more similar and avoids the useless final argument. spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/systemd -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libsystemd/sd-bus -I ./src/libsystemd/sd-event -I ./src/libsystemd/sd-login -I ./src/libsystemd/sd-netlink -I ./src/libsystemd/sd-network -I ./src/libsystemd/sd-hwdb -I ./src/libsystemd/sd-device -I ./src/libsystemd/sd-id128 -I ./src/libsystemd-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/.c) git grep -e '\bstrjoin\b.NULL' -l\|xargs sed -i -r 's/strjoin$(.*), NULL$/strjoin(\1)/' This might have missed a few cases (spatch has a really hard time dealing with _cleanup_ macros), but that's no big issue, they can always be fixed later.	2016-10-23 11:43:27 -04:00
Luca Bruno	52c239d770	core/exec: add a named-descriptor option ("fd") for streams (#4179 ) This commit adds a `fd` option to `StandardInput=`, `StandardOutput=` and `StandardError=` properties in order to connect standard streams to externally named descriptors provided by some socket units. This option looks for a file descriptor named as the corresponding stream. Custom names can be specified, separated by a colon. If multiple name-matches exist, the first matching fd will be used.	2016-10-17 20:05:49 -04:00
Zbigniew Jędrzejewski-Szmek	6b430fdb7c	tree-wide: use mfree more	2016-10-16 23:35:39 -04:00
Djalal Harouni	e66a2f658b	core: make sure to dump ProtectKernelModules= value	2016-10-12 14:12:17 +02:00
Djalal Harouni	4084e8fc89	core: check protect_kernel_modules and private_devices in order to setup NNP	2016-10-12 14:12:07 +02:00
Djalal Harouni	c575770b75	core:sandbox: lets make /lib/modules/ inaccessible on ProtectKernelModules= Lets go further and make /lib/modules/ inaccessible for services that do not have business with modules, this is a minor improvment but it may help on setups with custom modules and they are limited... in regard of kernel auto-load feature. This change introduce NameSpaceInfo struct which we may embed later inside ExecContext but for now lets just reduce the argument number to setup_namespace() and merge ProtectKernelModules feature.	2016-10-12 14:11:16 +02:00
Djalal Harouni	502d704e5e	core:sandbox: Add ProtectKernelModules= option This is useful to turn off explicit module load and unload operations on modular kernels. This option removes CAP_SYS_MODULE from the capability bounding set for the unit, and installs a system call filter to block module system calls. This option will not prevent the kernel from loading modules using the module auto-load feature which is a system wide operation.	2016-10-12 13:31:21 +02:00
Lennart Poettering	e0d2adfde6	core: chown() any TTY used for stdin, not just when StandardInput=tty is used (#4347 ) If stdin is supplied as an fd for transient units (using the StandardInputFileDescriptor pseudo-property for transient units), then we should also fix up the TTY ownership, not just when we opened the TTY ourselves. This simply drops the explicit is_terminal_input()-based check. Note that chown_terminal() internally does a much more appropriate isatty()-based check anyway, hence we can drop this without replacement. Fixes: #4260	2016-10-11 14:07:22 -04:00
Lennart Poettering	4b58153dd2	core: add "invocation ID" concept to service manager This adds a new invocation ID concept to the service manager. The invocation ID identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is generated each time a unit moves from and inactive to an activating or active state. The primary usecase for this concept is to connect the runtime data PID 1 maintains about a service with the offline data the journal stores about it. Previously we'd use the unit name plus start/stop times, which however is highly racy since the journal will generally process log data after the service already ended. The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel, except that it applies to an individual unit instead of the whole system. The invocation ID is passed to the activated processes as environment variable. It is additionally stored as extended attribute on the cgroup of the unit. The latter is used by journald to automatically retrieve it for each log logged message and attach it to the log entry. The environment variable is very easily accessible, even for unprivileged services. OTOH the extended attribute is only accessible to privileged processes (this is because cgroupfs only supports the "trusted." xattr namespace, not "user."). The environment variable may be altered by services, the extended attribute may not be, hence is the better choice for the journal. Note that reading the invocation ID off the extended attribute from journald is racy, similar to the way reading the unit name for a logging process is. This patch adds APIs to read the invocation ID to sd-id128: sd_id128_get_invocation() may be used in a similar fashion to sd_id128_get_boot(). PID1's own logging is updated to always include the invocation ID when it logs information about a unit. A new bus call GetUnitByInvocationID() is added that allows retrieving a bus path to a unit by its invocation ID. The bus path is built using the invocation ID, thus providing a path for referring to a unit that is valid only for the current runtime cycleof it. Outlook for the future: should the kernel eventually allow passing of cgroup information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we can alter the invocation ID to be generated as hash from that rather than entirely randomly. This way we can derive the invocation race-freely from the messages.	2016-10-07 20:14:38 +02:00
Lennart Poettering	97f0e76f18	user-util: rework maybe_setgroups() a bit Let's drop the caching of the setgroups /proc field for now. While there's a strict regime in place when it changes states, let's better not cache it since we cannot really be sure we follow that regime correctly. More importantly however, this is not in performance sensitive code, and there's no indication the cache is really beneficial, hence let's drop the caching and make things a bit simpler. Also, while we are at it, rework the error handling a bit, and always return negative errno-style error codes, following our usual coding style. This has the benefit that we can sensible hanld read_one_line_file() errors, without having to updat errno explicitly.	2016-10-06 19:04:10 +02:00
Lennart Poettering	2d6fce8d7c	core: leave PAM stub process around with GIDs updated In the process execution code of PID 1, before `096424d123` the GID settings where changed before invoking PAM, and the UID settings after. After the change both changes are made after the PAM session hooks are run. When invoking PAM we fork once, and leave a stub process around which will invoke the PAM session end hooks when the session goes away. This code previously was dropping the remaining privs (which were precisely the UID). Fix this code to do this correctly again, by really dropping them else (i.e. the GID as well). While we are at it, also fix error logging of this code. Fixes: #4238	2016-10-06 19:04:10 +02:00
Giuseppe Scrivano	36d854780c	core: do not fail in a container if we can't use setgroups It might be blocked through /proc/PID/setgroups	2016-10-06 11:49:00 +02:00
Stefan Schweter	629ff674ac	tree-wide: remove consecutive duplicate words in comments	2016-10-04 17:06:25 +02:00
Djalal Harouni	8f81a5f61b	core: Use @raw-io syscall group to filter I/O syscalls when PrivateDevices= is set Instead of having a local syscall list, use the @raw-io group which contains the same set of syscalls to filter.	2016-09-25 12:52:27 +02:00
Lennart Poettering	cefc33aee2	execute: move SMACK setup code into its own function While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the main execution logic a bit.	2016-09-25 10:52:57 +02:00
Lennart Poettering	ba128bb809	execute: filter low-level I/O syscalls if PrivateDevices= is set If device access is restricted via PrivateDevices=, let's also block the various low-level I/O syscalls at the same time, so that we know that the minimal set of devices in our virtualized /dev are really everything the unit can access.	2016-09-25 10:52:57 +02:00
Lennart Poettering	096424d123	execute: drop group priviliges only after setting up namespace If PrivateDevices=yes is set, the namespace code creates device nodes in /dev that should be owned by the host's root, hence let's make sure we set up the namespace before dropping group privileges.	2016-09-25 10:42:18 +02:00
Lennart Poettering	3fbe8dbe41	execute: if RuntimeDirectory= is set, it should be writable Implicitly make all dirs set with RuntimeDirectory= writable, as the concept otherwise makes no sense.	2016-09-25 10:19:05 +02:00
Lennart Poettering	be39ccf3a0	execute: move suppression of HOME=/ and SHELL=/bin/nologin into user-util.c This adds a new call get_user_creds_clean(), which is just like get_user_creds() but returns NULL in the home/shell parameters if they contain no useful information. This code previously lived in execute.c, but by generalizing this we can reuse it in run.c.	2016-09-25 10:18:57 +02:00
Lennart Poettering	07689d5d2c	execute: split out creation of runtime dirs into its own functions	2016-09-25 10:18:54 +02:00
Lennart Poettering	59eeb84ba6	core: add two new service settings ProtectKernelTunables= and ProtectControlGroups= If enabled, these will block write access to /sys, /proc/sys and /proc/sys/fs/cgroup.	2016-09-25 10:18:48 +02:00
Lennart Poettering	72246c2a65	core: enforce seccomp for secondary archs too, for all rules Let's make sure that all our rules apply to all archs the local kernel supports.	2016-09-25 10:18:44 +02:00
Felipe Sateler	d347d9029c	seccomp: also detect if seccomp filtering is enabled In https://github.com/systemd/systemd/pull/4004 , a runtime detection method for seccomp was added. However, it does not detect the case where CONFIG_SECCOMP=y but CONFIG_SECCOMP_FILTER=n. This is possible if the architecture does not support filtering yet. Add a check for that case too. While at it, change get_proc_field usage to use PR_GET_SECCOMP prctl, as that should save a few system calls and (unnecessary) allocations. Previously, reading of /proc/self/stat was done as recommended by prctl(2) as safer. However, given that we need to do the prctl call anyway, lets skip opening, reading and parsing the file. Code for checking inspired by https://outflux.net/teach-seccomp/autodetect.html	2016-09-06 20:25:49 -03:00
Felipe Sateler	83f12b27d1	core: do not fail at step SECCOMP if there is no kernel support (#4004 ) Fixes #3882	2016-08-22 22:40:58 +03:00

1 2 3 4 5 ...

407 commits