Systemd

Author	SHA1	Message	Date
Lennart Poettering	e08f94acf5	loop-util: accept loopback flags when creating loopback device This way callers can choose if they want partition scanning or not.	2019-12-02 10:05:09 +01:00
Kevin Kuehler	94a7b2759d	core: ProtectKernelLogs= mask kmsg in proc and sys Block access to /dev/kmsg and /proc/kmsg when ProtectKernelLogs is set.	2019-11-14 12:58:43 -08:00
Yu Watanabe	e30e8b5073	tree-wide: drop stat.h or statfs.h when stat-util.h is included	2019-11-04 00:30:32 +09:00
Yu Watanabe	455fa9610c	tree-wide: drop string.h when string-util.h or friends are included	2019-11-04 00:30:32 +09:00
Yu Watanabe	f5947a5e92	tree-wide: drop missing.h	2019-10-31 17:57:03 +09:00
Zbigniew Jędrzejewski-Szmek	a5648b8094	basic/fs-util: change CHASE_OPEN flag into a separate output parameter chase_symlinks() would return negative on error, and either a non-negative status or a non-negative fd when CHASE_OPEN was given. This made the interface quite complicated, because dependning on the flags used, we would get two different "types" of return object. Coverity was always confused by this, and flagged every use of chase_symlinks() without CHASE_OPEN as a resource leak (because it would this that an fd is returned). This patch uses a saparate output parameter, so there is no confusion. (I think it is OK to have functions which return either an error or an fd. It's only returning either an fd or a non-fd that is confusing.)	2019-10-24 22:44:24 +09:00
Lennart Poettering	2caa38e99f	tree-wide: some more [static] related fixes let's add [static] where it was missing so far Drop [static] on parameters that can be NULL. Add an assert() around parameters that have [static] and can't be NULL hence. Add some "const" where it was forgotten.	2019-07-12 16:40:10 +02:00
Lennart Poettering	c6134d3e2f	path-util: get rid of prefix_root() prefix_root() is equivalent to path_join() in almost all ways, hence let's remove it. There are subtle differences though: prefix_root() will try shorten multiple "/" before and after the prefix. path_join() doesn't do that. This means prefix_root() might return a string shorter than both its inputs combined, while path_join() never does that. I like the path_join() semantics better, hence I think dropping prefix_root() is totally OK. In the end the strings generated by both functon should always be identical in terms of path_equal() if not streq(). This leaves prefix_roota() in place. Ideally we'd have path_joina(), but I don't think we can reasonably implement that as a macro. or maybe we can? (if so, sounds like something for a later PR) Also add in a few missing OOM checks	2019-06-21 08:42:55 +09:00
Zbigniew Jędrzejewski-Szmek	7cc5ef5f18	pid1: improve message when setting up namespace fails I covered the most obvious paths: those where there's a clear problem with a path specified by the user. Prints something like this (at error level): May 21 20:00:01.040418 systemd[125871]: bad-workdir.service: Failed to set up mount namespacing: /run/systemd/unit-root/etc/tomcat9/Catalina: No such file or directory May 21 20:00:01.040456 systemd[125871]: bad-workdir.service: Failed at step NAMESPACE spawning /bin/true: No such file or directory Fixes #10972.	2019-05-22 16:28:02 +02:00
Lennart Poettering	6990fb6bc6	tree-wide: (void)ify a few unlink() and rmdir() Let's be helpful to static analyzers which care about whether we knowingly ignore return values. We do in these cases, since they are usually part of error paths.	2019-03-27 18:09:56 +01:00
Lennart Poettering	9ce4e4b0f6	namespace: when DynamicUser=1 is set, mount StateDirectory= bind mounts "nosuid" Add even more suid/sgid protection to DynamicUser= envionments: the state directories we bind mount from the host will now have the nosuid flag set, to disable the effect of nosuid on them.	2019-03-25 19:57:15 +01:00
Lennart Poettering	64e82c1976	mount-util: beef up bind_remount_recursive() to be able to toggle more than MS_RDONLY The function is otherwise generic enough to toggle other bind mount flags beyond MS_RDONLY (for example: MS_NOSUID or MS_NODEV), hence let's beef it up slightly to support that too.	2019-03-25 19:33:55 +01:00
Lennart Poettering	867189b545	namespace: get rid of {} around single-line if blocks	2019-03-25 19:33:55 +01:00
Lennart Poettering	39e91a2777	namespace: get rid of local variable	2019-03-25 19:33:55 +01:00
Lennart Poettering	1019a48f40	namespace: (void)ify a number of syscalls	2019-03-25 19:33:55 +01:00
Lennart Poettering	5f7a690aaa	namespace: replace one case of stack allocation with heap allocation The list of mounts might grow quite large, let's avoid the stack for this. Better safe than sorry.	2019-03-25 19:33:55 +01:00
Lennart Poettering	d8b4d14df4	util: split out nulstr related stuff to nulstr-util.[ch]	2019-03-14 13:25:52 +01:00
Lennart Poettering	760877e90c	util: split out sorting related calls to new sort-util.[ch]	2019-03-13 12:16:43 +01:00
Lennart Poettering	0cb8e3d118	util: split out namespace related stuff into a new namespace-util.[ch] pair Just some minor reorganiztion.	2019-03-13 12:16:38 +01:00
Yu Watanabe	5beb8688e0	core/namespace: logs mount mode when the entry is dropped	2019-03-13 11:53:22 +09:00
Yu Watanabe	1e05071d27	core/namespace: introduce new mount mode READWRITE_IMPLICIT ProtectSystem=strict or ProtectKernelTunable=yes create implicit read-write mounts, but they are not overridable by TemporaryFileSystem=. This makes such implicit read-write mounts use the new mount mode. So, they can be override by TemproraryFileSystem= now. A typical usecase is that ProtectSystem=strict and ProtectHome=tmpfs. Fixes #11276.	2019-03-13 11:51:09 +09:00
Lennart Poettering	51af7fb230	core: add open_netns_path() helper The new call allows us to open a netns from the file system, and store it in a "storage fd pair". It's supposed to work with setup_netns() and allows pre-population of the netns used with one opened from the file system.	2019-03-07 16:55:23 +01:00
Lennart Poettering	44ffcbaea4	execute: (void)ify more	2019-03-07 16:53:45 +01:00
Topi Miettinen	aecd5ac621	core: ProtectHostname= feature Let services use a private UTS namespace. In addition, a seccomp filter is installed on set{host,domain}name and a ro bind mounts on /proc/sys/kernel/{host,domain}name.	2019-02-20 10:50:44 +02:00
Zbigniew Jędrzejewski-Szmek	3042bbebdd	tree-wide: use c99 static for array size declarations https://hamberg.no/erlend/posts/2013-02-18-static-array-indices.html This only works with clang, unfortunately gcc doesn't seem to implement the check (tested with gcc-8.2.1-5.fc29.x86_64). Simulated error: [2/3] Compiling C object 'systemd-nspawn@exe/src_nspawn_nspawn.c.o'. ../src/nspawn/nspawn.c:3179:45: warning: array argument is too small; contains 15 elements, callee requires at least 16 [-Warray-bounds] candidate = (uid_t) siphash24(arg_machine, strlen(arg_machine), hash_key); ^ ~~~~~~~~ ../src/basic/siphash24.h:24:64: note: callee declares array parameter as static here uint64_t siphash24(const void *in, size_t inlen, const uint8_t k[static 16]); ^~~~~~~~~~~~	2019-01-04 12:37:25 +01:00
Zbigniew Jędrzejewski-Szmek	049af8ad0c	Split out part of mount-util.c into mountpoint-util.c The idea is that anything which is related to actually manipulating mounts is in mount-util.c, but functions for mountpoint introspection are moved to the new file. Anything which requires libmount must be in mount-util.c. This was supposed to be a preparation for further changes, with no functional difference, but it results in a significant change in linkage: $ ldd build/libnss_*.so.2 (before) build/libnss_myhostname.so.2: linux-vdso.so.1 (0x00007fff77bf5000) librt.so.1 => /lib64/librt.so.1 (0x00007f4bbb7b2000) libmount.so.1 => /lib64/libmount.so.1 (0x00007f4bbb755000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4bbb734000) libc.so.6 => /lib64/libc.so.6 (0x00007f4bbb56e000) /lib64/ld-linux-x86-64.so.2 (0x00007f4bbb8c1000) libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f4bbb51b000) libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f4bbb512000) libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f4bbb4e3000) libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f4bbb45e000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f4bbb458000) build/libnss_mymachines.so.2: linux-vdso.so.1 (0x00007ffc19cc0000) librt.so.1 => /lib64/librt.so.1 (0x00007fdecb74b000) libcap.so.2 => /lib64/libcap.so.2 (0x00007fdecb744000) libmount.so.1 => /lib64/libmount.so.1 (0x00007fdecb6e7000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdecb6c6000) libc.so.6 => /lib64/libc.so.6 (0x00007fdecb500000) /lib64/ld-linux-x86-64.so.2 (0x00007fdecb8a9000) libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fdecb4ad000) libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fdecb4a2000) libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fdecb475000) libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007fdecb3f0000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fdecb3ea000) build/libnss_resolve.so.2: linux-vdso.so.1 (0x00007ffe8ef8e000) librt.so.1 => /lib64/librt.so.1 (0x00007fcf314bd000) libcap.so.2 => /lib64/libcap.so.2 (0x00007fcf314b6000) libmount.so.1 => /lib64/libmount.so.1 (0x00007fcf31459000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fcf31438000) libc.so.6 => /lib64/libc.so.6 (0x00007fcf31272000) /lib64/ld-linux-x86-64.so.2 (0x00007fcf31615000) libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fcf3121f000) libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fcf31214000) libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fcf311e7000) libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007fcf31162000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fcf3115c000) build/libnss_systemd.so.2: linux-vdso.so.1 (0x00007ffda6d17000) librt.so.1 => /lib64/librt.so.1 (0x00007f610b83c000) libcap.so.2 => /lib64/libcap.so.2 (0x00007f610b835000) libmount.so.1 => /lib64/libmount.so.1 (0x00007f610b7d8000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f610b7b7000) libc.so.6 => /lib64/libc.so.6 (0x00007f610b5f1000) /lib64/ld-linux-x86-64.so.2 (0x00007f610b995000) libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f610b59e000) libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f610b593000) libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f610b566000) libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f610b4e1000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f610b4db000) (after) build/libnss_myhostname.so.2: linux-vdso.so.1 (0x00007fff0b5e2000) librt.so.1 => /lib64/librt.so.1 (0x00007fde0c328000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fde0c307000) libc.so.6 => /lib64/libc.so.6 (0x00007fde0c141000) /lib64/ld-linux-x86-64.so.2 (0x00007fde0c435000) build/libnss_mymachines.so.2: linux-vdso.so.1 (0x00007ffdc30a7000) librt.so.1 => /lib64/librt.so.1 (0x00007f06ecabb000) libcap.so.2 => /lib64/libcap.so.2 (0x00007f06ecab4000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f06eca93000) libc.so.6 => /lib64/libc.so.6 (0x00007f06ec8cd000) /lib64/ld-linux-x86-64.so.2 (0x00007f06ecc15000) build/libnss_resolve.so.2: linux-vdso.so.1 (0x00007ffe95747000) librt.so.1 => /lib64/librt.so.1 (0x00007fa56a80f000) libcap.so.2 => /lib64/libcap.so.2 (0x00007fa56a808000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa56a7e7000) libc.so.6 => /lib64/libc.so.6 (0x00007fa56a621000) /lib64/ld-linux-x86-64.so.2 (0x00007fa56a964000) build/libnss_systemd.so.2: linux-vdso.so.1 (0x00007ffe67b51000) librt.so.1 => /lib64/librt.so.1 (0x00007ffb32113000) libcap.so.2 => /lib64/libcap.so.2 (0x00007ffb3210c000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffb320eb000) libc.so.6 => /lib64/libc.so.6 (0x00007ffb31f25000) /lib64/ld-linux-x86-64.so.2 (0x00007ffb3226a000) I don't quite understand what is going on here, but let's not be too picky.	2018-11-29 21:03:44 +01:00
Zbigniew Jędrzejewski-Szmek	baaa35ad70	coccinelle: make use of SYNTHETIC_ERRNO Ideally, coccinelle would strip unnecessary braces too. But I do not see any option in coccinelle for this, so instead, I edited the patch text using search&replace to remove the braces. Unfortunately this is not fully automatic, in particular it didn't deal well with if-else-if-else blocks and ifdefs, so there is an increased likelikehood be some bugs in such spots. I also removed part of the patch that coccinelle generated for udev, where we returns -1 for failure. This should be fixed independently.	2018-11-22 10:54:38 +01:00
Yu Watanabe	93bab28895	tree-wide: use typesafe_qsort()	2018-09-19 08:02:52 +09:00
Yu Watanabe	2e4a4faea8	core/namespace: add more log messages	2018-09-18 14:31:09 +09:00
Alan Jenkins	fcac12d150	namespace: remove redundant .has_prefix=false The MountEntry's added for EMPTY_DIR work very similarly to the TMPFS ones. In both cases, .has_prefix is false. In fact, .has_prefix is false in all the MountEntry's we add except for the access mounts (READONLY etc). But EMPTY_DIR stuck out by explicitly setting .has_prefix = false. Let's remove that.	2018-09-01 17:23:01 +09:00
Alan Jenkins	4a756839e6	namespace: we always use a root_directory now We changed to always setup the new namespace in a separate directory (commit `0722b35`)	2018-09-01 17:23:01 +09:00
Alan Jenkins	ad8e66dcc4	namespace: fix mode for TemporaryFileSystem= ... when no mount options are passed. Change the code, to avoid the following failure in the newly added tests: exec-temporaryfilesystem-rw.service: Executing: /usr/bin/sh -x -c '[ "$(stat -c %a /var)" == 755 ]' ++ stat -c %a /var + '[' 1777 == 755 ']' Received SIGCHLD from PID 30364 (sh). Child 30364 (sh) died (code=exited, status=1/FAILURE) (And I spotted an opportunity to use TAKE_PTR() at the end).	2018-09-01 17:22:14 +09:00
Alan Jenkins	69338c3dfb	namespace: don't try to remount superblocks We can't remount the underlying superblocks, if we are inside a user namespace and running Linux <= 4.17. We can only change the per-mount flags (MS_REMOUNT \| MS_BIND). This type of mount() call can only change the per-mount flags, so we don't have to worry about passing the right string options now. Fixes #9914 ("Since `1beab8b` was merged, systemd has been failing to start systemd-resolved inside unprivileged containers" ... "Failed to re-mount '/run/systemd/unit-root/dev' read-only: Operation not permitted"). > It's basically my fault :-). I pointed out we could remount read-only > without MS_BIND when reviewing the PR that added TemporaryFilesystem=, > and poettering suggested to change PrivateDevices= at the same time. > I think it's safe to change back, and I don't expect anyone will notice > a difference in behaviour. > > It just surprised me to realize that > `TemporaryFilesystem=/tmp:size=10M,ro,nosuid` would not apply `ro` to the > superblock (underlying filesystem), like mount -osize=10M,ro,nosuid does. > Maybe a comment could note the kernel version (v4.18), that lets you > remount without MS_BIND inside a user namespace. This makes the code longer and I guess this function is still ugly, sorry. One obstacle to cleaning it up is the interaction between `PrivateDevices=yes` and `ReadOnlyPaths=/dev`. I've added a test for the existing behaviour, which I think is now the correct behaviour.	2018-08-30 11:17:16 +01:00
Yu Watanabe	52e4d62550	Merge pull request #9852 from poettering/namespace-errno namespace: be more careful when handling namespacing failures	2018-08-22 11:16:29 +09:00
Lennart Poettering	1beab8b0d0	namespace: be more careful when handling namespacing failures gracefully This makes two changes to the namespacing code: 1. We'll only gracefully skip service namespacing on access failure if exclusively sandboxing options where selected, and not mount-related options that result in a very different view of the world. For example, ignoring RootDirectory=, RootImage= or Bind= is really probablematic, but ReadOnlyPaths= is just a weaker sandbox. 2. The namespacing code will now return a clearly recognizable error code when it cannot enforce its namespacing, so that we cannot confuse EPERM errors from mount() with those from unshare(). Only the errors from the first unshare() are now taken as hint to gracefully disable namespacing. Fixes: #9844 #9835	2018-08-21 20:00:33 +02:00
Zbigniew Jędrzejewski-Szmek	7692fed98b	Merge pull request #9783 from poettering/get-user-creds-flags beef up get_user_creds() a bit and other improvements	2018-08-21 10:09:33 +02:00
Lennart Poettering	b2a60844c4	namespace: when creating device nodes, also create /dev/char/* symlinks On the host these symlinks are created by udev, and we consider them API and make use of them ourselves at various places. Hence when running a private /dev, also create these symlinks so that lookups by major/minor work in such an environment, too.	2018-08-20 15:58:11 +02:00
Yu Watanabe	763a260ae7	core/namespace: add more log messages Suggested by #9835.	2018-08-10 14:30:35 +09:00
Yu Watanabe	839f187753	core/namespace: drop mount points outside of root even if RootDirectory= is not set	2018-08-06 12:51:33 +09:00
Yu Watanabe	9b68367b3a	core/namespace: drop conditions depends on `root` is empty or not After `0722b35934`, the variable `root` is always set.	2018-08-06 12:51:33 +09:00
Chris Lamb	3fe910794b	Correct a number of trivial typos.	2018-06-18 22:44:44 +02:00
Yu Watanabe	1e8c7bd55c	namespace: drop protect_{home,system}_or_bool_from_string() The functions protect_{home,system}_from_string() are not used except for defining protect_{home,system}_or_bool_from_string(). This makes protect_{home,system}_from_string() support boolean strings, and drops protect_{home,system}_or_bool_from_string().	2018-06-15 11:32:27 +02:00
Zbigniew Jędrzejewski-Szmek	b0450864f1	Merge pull request #9274 from poettering/comment-header-cleanup drop "this file is part of systemd" and lennart's copyright from header	2018-06-14 11:26:50 +02:00
Jan Synacek	0722b35934	namespace: always use a root directory when setting up namespace 1) mv /var/tmp /var/tmp.old 2) mkdir /tmp/varrr 3) ln -s /tmp/varrr /var/tmp Now, when a service has PrivateTmp=yes, during namespace setup, /tmp is first mounted over with a new mount. Then, when /var/tmp is being resolved, it points to /tmp/varrr, which by then doesn't exist, because it had already been obscured.	2018-06-14 10:25:16 +02:00
Lennart Poettering	0c69794138	tree-wide: remove Lennart's copyright lines These lines are generally out-of-date, incomplete and unnecessary. With SPDX and git repository much more accurate and fine grained information about licensing and authorship is available, hence let's drop the per-file copyright notice. Of course, removing copyright lines of others is problematic, hence this commit only removes my own lines and leaves all others untouched. It might be nicer if sooner or later those could go away too, making git the only and accurate source of authorship information.	2018-06-14 10:20:20 +02:00
Lennart Poettering	818bf54632	tree-wide: drop 'This file is part of systemd' blurb This part of the copyright blurb stems from the GPL use recommendations: https://www.gnu.org/licenses/gpl-howto.en.html The concept appears to originate in times where version control was per file, instead of per tree, and was a way to glue the files together. Ultimately, we nowadays don't live in that world anymore, and this information is entirely useless anyway, as people are very welcome to copy these files into any projects they like, and they shouldn't have to change bits that are part of our copyright header for that. hence, let's just get rid of this old cruft, and shorten our codebase a bit.	2018-06-14 10:20:20 +02:00
Zbigniew Jędrzejewski-Szmek	5d904a6aaa	tree-wide: drop !! casts to booleans They are not needed, because anything that is non-zero is converted to true. C11: > 6.3.1.2: When any scalar value is converted to _Bool, the result is 0 if the > value compares equal to 0; otherwise, the result is 1. https://stackoverflow.com/questions/31551888/casting-int-to-bool-in-c-c	2018-06-13 10:52:40 +02:00
Lennart Poettering	228af36fff	core: add new PrivateMounts= unit setting This new setting is supposed to be useful in most cases where "MountFlags=slave" is currently used, i.e. as an explicit way to run a service in its own mount namespace and decouple propagation from all mounts of the new mount namespace towards the host. The effect of MountFlags=slave and PrivateMounts=yes is mostly the same, as both cause a CLONE_NEWNS namespace to be opened, and both will result in all mounts within it to be mounted MS_SLAVE. The difference is mostly on the conceptual/philosophical level: configuring the propagation mode is nothing people should have to think about, in particular as the matter is not precisely easyto grok. Moreover, MountFlags= allows configuration of "private" and "slave" modes which don't really make much sense to use in real-life and are quite confusing. In particular PrivateMounts=private means mounts made on the host stay pinned for good by the service which is particularly nasty for removable media mount. And PrivateMounts=shared is in most ways a NOP when used a alone... The main technical difference between setting only MountFlags=slave or only PrivateMounts=yes in a unit file is that the former remounts all mounts to MS_SLAVE and leaves them there, while that latter remounts them to MS_SHARED again right after. The latter is generally a nicer approach, since it disables propagation, while MS_SHARED is afterwards in effect, which is really nice as that means further namespacing down the tree will get MS_SHARED logic by default and we unify how applications see our mounts as we always pass them as MS_SHARED regardless whether any mount namespacing is used or not. The effect of PrivateMounts=yes was implied already by all the other mount namespacing options. With this new option we add an explicit knob for it, to request it without any other option used as well. See: #4393	2018-06-12 16:12:10 +02:00
Yu Watanabe	fa65c28176	namespace: rename parse_protect_{home,system}_or_bool() to protect_{home,system}_or_bool_to_string() Hence, we can define config_parse_protect_{home,system}() by using DEFINE_CONFIG_PARSE_ENUM() macro.	2018-05-31 11:09:41 +09:00
Lennart Poettering	4e2c0a227e	namespace: extend list of masked files by ProtectKernelTunables= This adds a number of entries nspawn already applies to regular service namespacing too. Most importantly let's mask /proc/kcore and /proc/kallsyms too.	2018-05-03 17:46:31 +02:00
Lennart Poettering	da6053d0a7	tree-wide: be more careful with the type of array sizes Previously we were a bit sloppy with the index and size types of arrays, we'd regularly use unsigned. While I don't think this ever resulted in real issues I think we should be more careful there and follow a stricter regime: unless there's a strong reason not to use size_t for array sizes and indexes, size_t it should be. Any allocations we do ultimately will use size_t anyway, and converting forth and back between unsigned and size_t will always be a source of problems. Note that on 32bit machines "unsigned" and "size_t" are equivalent, and on 64bit machines our arrays shouldn't grow that large anyway, and if they do we have a problem, however that kind of overly large allocation we have protections for usually, but for overflows we do not have that so much, hence let's add it. So yeah, it's a story of the current code being already "good enough", but I think some extra type hygiene is better. This patch tries to be comprehensive, but it probably isn't and I missed a few cases. But I guess we can cover that later as we notice it. Among smaller fixes, this changes: 1. strv_length()' return type becomes size_t 2. the unit file changes array size becomes size_t 3. DNS answer and query array sizes become size_t Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=76745	2018-04-27 14:29:06 +02:00
Lennart Poettering	088696fe29	namespace: rework how we resolve symlinks in mount points Before this patch we'd resolve all symlinks of bind mounts and other mount points to establish for a service in advance, and only then start mounting them. This is problematic, if symlink chains jump around between directories in a namespace tree, so that to resolve a specific symlink chain we need to establish another mount already. A typical case where this happens is if /etc/resolv.conf is a symlink to some file in /run: in that case we'd normally resolve and mount /etc/resolv.conf early on, but that's broken, as to do this properly we'd need to resolve /etc/resolv.conf first, then figure out that /run needs to be mounted before we can proceed, and thus reorder the order in which we apply mounts dynamically. With this change, whenever we are about to apply a mount, we'll do a single step of the symlink normalization process, patch the mount entry accordingly, and then sort the list of mounts to establish again, taking the new path into account. This means that we can correctly deal with the example above: we might start with wanting to mount /etc/resolv.conf early, but after resolving it to the path in /run/ we'd push it to the end of the list, ensuring that /run is mounted first. (Note that this also fixes another bug: we were following symlinks on the bind mount source relative to the root directory of the service, rather than of the host. That's wrong though as we explicitly document tha the source of bind mounts is always on the host.)	2018-04-18 14:17:50 +02:00
Lennart Poettering	e871786273	namespace: improve logging when creating mount source nodes	2018-04-18 14:15:48 +02:00
Lennart Poettering	f8b64b5723	namespace: split out calls to normalize mount entry list into new function	2018-04-18 14:15:48 +02:00
Lennart Poettering	c9ef8573be	namespace: don't consider raw image read-only if /home in it is writable	2018-04-18 14:15:48 +02:00
Lennart Poettering	12777909c9	Merge pull request #8417 from brauner/2018-03-09/add_bind_mount_fallback_to_private_devices core: fall back to bind-mounts for PrivateDevices= execution environments	2018-04-18 11:56:56 +02:00
Zbigniew Jędrzejewski-Szmek	af984e137e	core/namespace: rework the return semantics of clone_device_node yet again Returning 0 on not-found/wrong-type is confusing. Let's return -ENXIO in that case instead, and explicitly ignore it in the call site where we want to do that. I think this is clearer and less likely to be used errenously in case another call site is added. C.f. `152c475f95` and `98b1d2b8d9`.	2018-04-12 18:15:33 +02:00
Christian Brauner	1649861744	core: fall back to bind-mounts for PrivateDevices= execution environments In environments where CAP_MKNOD is not available or inside user namespaces it is still desirable to enable services to use PrivateDevices= . So fall back to using bind-mounts on EPERM.	2018-04-12 18:15:12 +02:00
Zbigniew Jędrzejewski-Szmek	11a1589223	tree-wide: drop license boilerplate Files which are installed as-is (any .service and other unit files, .conf files, .policy files, etc), are left as is. My assumption is that SPDX identifiers are not yet that well known, so it's better to retain the extended header to avoid any doubt. I also kept any copyright lines. We can probably remove them, but it'd nice to obtain explicit acks from all involved authors before doing that.	2018-04-06 18:58:55 +02:00
Yu Watanabe	1cc6c93a95	tree-wide: use TAKE_PTR() and TAKE_FD() macros	2018-04-05 14:26:26 +09:00
Lennart Poettering	62570f6f03	fs-util: add new CHASE_TRAIL_SLASH flag for chase_symlinks() This rearranges chase_symlinks() a bit: if no special flags are specified it will now revert to behaviour before `b12d25a8d6`. However, if the new CHASE_TRAIL_SLASH flag is specified it will follow the behaviour introduced by that commit. I wasn't sure which one to make the beaviour that requires specification of a flag to enable. I opted to make the "append trailing slash" behaviour the one to enable by a flag, following the thinking that the function should primarily be used to generate a normalized path, and I am pretty sure a path without trailing slash is the more "normalized" one, as the trailing slash is not really a part of it, but merely a "decorator" that tells various system calls to generate ENOTDIR if the path doesn't refer to a path. Or to say this differently: if the slash was part of normalization then we really should add it in all cases when the final path is a directory, not just when the user originally specified it. Fixes: #8544 Replaces: #8545	2018-03-22 19:54:24 +01:00
Zbigniew Jędrzejewski-Szmek	671f0f8de0	Remove /sbin from paths if split-bin is false (#8324 ) Follow-up for `157baa87e4`.	2018-03-01 21:48:36 +01:00
Ansgar Burchardt	7486f305cd	Include additional directories in ProtectSystem	2018-02-27 18:56:19 -03:00
Zbigniew Jędrzejewski-Szmek	aa484f3561	tree-wide: use reallocarray instead of our home-grown realloc_multiply (#8279 ) There isn't much difference, but in general we prefer to use the standard functions. glibc provides reallocarray since version 2.26. I moved explicit_bzero is configure test to the bottom, so that the two stdlib functions are at the bottom.	2018-02-26 21:20:00 +01:00
Lennart Poettering	13a141f046	namespace: protect bpf file system as part of ProtectKernelTunables= It also exposes kernel objects, let's better include this in ProtectKernelTunables=.	2018-02-21 16:43:36 +01:00
Yu Watanabe	e4da7d8c79	core: add new option 'tmpfs' to ProtectHome= This make ProtectHome= setting can take 'tmpfs'. This is mostly equivalent to `TemporaryFileSystem=/home /run/user /root`.	2018-02-21 09:18:17 +09:00
Yu Watanabe	2abd4e388a	core: add new setting TemporaryFileSystem= This introduces a new setting TemporaryFileSystem=. This is useful to hide files not relevant to the processes invoked by unit, while necessary files or directories can be still accessed by combining with Bind{,ReadOnly}Paths=.	2018-02-21 09:17:52 +09:00
Yu Watanabe	4ca763a902	core/namespace: make '-' prefix in Bind{,ReadOnly}Paths= work Each path in `Bind{ReadOnly}Paths=` accept '-' prefix. However, the prefix is completely ignored. This makes it work as expected.	2018-02-21 09:07:56 +09:00
Yu Watanabe	f5c52a7724	core/namespace: remove unused argument	2018-02-21 09:05:30 +09:00
Yu Watanabe	e282f51f57	core/namespace: use free_and_replace()	2018-02-21 09:05:21 +09:00
Yu Watanabe	55fe743273	core/namespace: fix comment	2018-02-21 09:05:18 +09:00
Yu Watanabe	89bd586cd3	core/namespace: merge PRIVATE_VAR_TMP into PRIVATE_TMP	2018-02-21 09:05:16 +09:00
Yu Watanabe	2a2969fd5d	core/namespace: make arguments const if possible	2018-02-21 09:05:14 +09:00
Zbigniew Jędrzejewski-Szmek	f863b1c6fa	core: move very long argument to a separate statement I like compact, but this was a bit too much.	2018-02-15 10:10:01 +01:00
Lennart Poettering	152c475f95	namepace: fix error handling when clone_device_node() returns 0 Before this patch, we'd treat clone_device_node() returning 0 (as opposed to 1) as error, but then propagate this non-error result in confusion. This makes sure that if we ptmx isn't around we propagate that as -ENXIO. This is a follow-up for `98b1d2b8d9`	2018-01-23 19:50:32 +01:00
Lennart Poettering	36ce7110b0	namespace: use is_symlink() helper We have this prett ylittle helper, let's use it, it makes things a tiny bit more readable.	2018-01-23 19:36:55 +01:00
Lennart Poettering	6f7f3a3351	namespace: use stack allocation for paths, where we can	2018-01-23 19:36:36 +01:00
Alan Jenkins	68f7480b7e	Merge pull request #7913 from sourcejedi/devpts 3 nitpicks from core/namespace.c	2018-01-18 21:56:26 +00:00
Alan Jenkins	225874dc9c	core: clone_device_node(): add debug message For people who use debug messages, maybe it is helpful to know that PrivateDevices= failed due to mknod(), and which device node. (The other (un-logged) failures could be while mounting filesystems e.g. no CAP_SYS_ADMIN which is the common case, or missing /dev/shm or /dev/pts, or missing /dev/ptmx).	2018-01-18 13:58:13 +00:00
Alan Jenkins	8d95368210	core: namespace: remove unnecessary mode on /dev/shm mount target This should have no behavioural effect; it just confused me. All the other mount directories in this function are created as 0755. Some of the mounts are allowed to fail - mqueue and hugepages. If the /dev/mqueue mount target was created with the permissive mode 01777, to match the filesystem we're trying to mount there, then a mount failure would allow unprivileged users to write to the /dev filesystem, e.g. to exhaust the available space. There is no reason to allow this. (Allowing the user read access (0755) seems a reasonable idea though, e.g. for quicker troubleshooting.) We do not allow failure of the /dev/shm mount, so it doesn't matter that it is created as 01777. But on the same grounds, we have no reason to create it as any specific mode. 0755 is equally fine. This function will be clearer by using 0755 throughout, to avoid unintentionally implying some connection between the mode of the mount target, and the mode of the mounted filesystem.	2018-01-17 18:04:34 +00:00
Alan Jenkins	98b1d2b8d9	core: namespace: nitpick /dev/ptmx error handling If /dev/tty did not exist, or had st_rdev == 0, we ignored it. And the same is true for null, zero, full, random, urandom. If /dev/ptmx did not exist, we treated this as a failure. If /dev/ptmx had st_rdev == 0, we ignored it. This was a very recent change, but there was no reason for ptmx creation specifically to treat st_rdev == 0 differently from non-existence. This confuses me when reading it. Change the creation of /dev/ptmx so that st_rdev == 0 is treated as failure. This still leaves /dev/ptmx as a special case with stricter handling. However it is consistent with the immediately preceding creation of /dev/pts/, which is treated as essential, and is directly related to ptmx. I don't know why we check st_rdev. But I'd prefer to have only one unanswered question here, and not to have a second unanswered question added on top.	2018-01-17 13:28:32 +00:00
Дамјан Георгиевски	414b304ba2	namespace: only make the symlink /dev/ptmx if it was already a symlink …otherwise try to clone it as a device node On most contemporary distros /dev/ptmx is a device node, and /dev/pts/ptmx has 000 inaccessible permissions. In those cases the symlink /dev/ptmx -> /dev/pts/ptmx breaks the pseudo tty support. In that case we better clone the device node. OTOH, in nspawn containers (and possibly others), /dev/pts/ptmx has normal permissions, and /dev/ptmx is a symlink. In that case make the same symlink. fixes #7878	2018-01-17 01:19:46 +01:00
Дамјан Георгиевски	b5e99f23ed	namespace: extract clone_device_node function from mount_private_dev	2018-01-16 21:41:10 +01:00
Yu Watanabe	03c791aa24	namespace: introduce parse_protect_system()_or_bool	2018-01-02 02:23:13 +09:00
Yu Watanabe	5e1c61544c	namespace: introduce parse_protect_home_or_bool()	2018-01-02 02:23:05 +09:00
Lennart Poettering	2d3a5a73e0	nspawn: make sure images containing an ESP are compatible with userns -U mode In -U mode we might need to re-chown() all files and directories to match the UID shift we want for the image. That's problematic on fat partitions, such as the ESP (and which is generated by mkosi's --bootable switch), because fat of course knows no UID/GID file ownership natively. With this change we take benefit of the uid= and gid= mount options FAT knows: instead of chown()ing all files and directories we can just specify the right UID/GID to use at mount time. This beefs up the image dissection logic in two ways: 1. First of all support for mounting relevant file systems with uid=/gid= is added: when a UID is specified during mount it is used for all applicable file systems. 2. Secondly, two new mount flags are added: DISSECT_IMAGE_MOUNT_ROOT_ONLY and DISSECT_IMAGE_MOUNT_NON_ROOT_ONLY. If one is specified the mount routine will either only mount the root partition of an image, or all partitions except the root partition. This is used by nspawn: first the root partition is mounted, so that we can determine the UID shift in use so far, based on ownership of the image's root directory. Then, we mount the remaining partitions in a second go, this time with the right UID/GID information.	2017-12-05 13:49:12 +01:00
Shawn Landden	4831981d89	tree-wide: adjust fall through comments so that gcc is happy Distcc removes comments, making the comment silencing not work. I know there was a decision against a macro in commit `ec251fe7d5`	2017-11-20 13:06:25 -08:00
Zbigniew Jędrzejewski-Szmek	53e1b68390	Add SPDX license identifiers to source files under the LGPL This follows what the kernel is doing, c.f. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.	2017-11-19 19:08:15 +01:00
Lennart Poettering	4e0c20de97	namespace: set up OS hierarchy only after mounting the new root, not before Otherwise it's a pointless excercise, as we'll set up an empty directory tree that's never going to be used. Hence, let's move this around a bit, so that we do the basesystem initialization exactly when RootImage= or RootDirectory= are used, but not otherwise.	2017-11-13 10:22:36 +01:00
Yu Watanabe	d18aff0422	core: ReadWritePaths= and friends assume '+' prefix when BindPaths= or freinds are set When at least one of BindPaths=, BindReadOnlyPaths=, RootImage=, RuntimeDirectory= or their friends are set, systemd prepares a namespace under /run/systemd/unit-root. Thus, ReadWritePaths= or their friends without '+' prefix is completely meaningless. So, let's assume '+' prefix when one of them are set. Fixes #7070 and #7080.	2017-11-08 15:48:01 +09:00
Lennart Poettering	0fa5b8312a	namespace: make ns_type_supported() a tiny bit shorter namespace_type_to_string() already validates the type paramater, we can use that, and shorten the function a bit.	2017-10-10 09:52:08 +02:00
Lennart Poettering	bb0ff3fb1b	namespace: change NameSpace → Namespace We generally use the casing "Namespace" for the word, and that's visible in a number of user-facing interfaces, including "RestrictNamespace=" or "JoinsNamespaceOf=". Let's make sure to use the same casing internally too. As discussed in #7024	2017-10-10 09:51:58 +02:00
Michal Sekletar	6e2d7c4f13	namespace: fall back gracefully when kernel doesn't support network namespaces (#7024 )	2017-10-10 09:46:13 +02:00
Zbigniew Jędrzejewski-Szmek	349cc4a507	build-sys: use #if Y instead of #ifdef Y everywhere The advantage is that is the name is mispellt, cpp will warn us. $ git grep -Ee "conf.set$'(HAVE\|ENABLE)_" -l\|xargs sed -r -i "s/conf.set\('(HAVE\|ENABLE)_/conf.set10('\1_/" $ git grep -Ee '#ifn?def (HAVE\|ENABLE)' -l\|xargs sed -r -i 's/#ifdef (HAVE\|ENABLE)/#if \1/; s/#ifndef (HAVE\|ENABLE)/#if ! \1/;' $ git grep -Ee 'if.defined\(HAVE' -l\|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_])$/\1/g' $ git grep -Ee 'if.defined$ENABLE' -l\|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_])$/\1/g' + manual changes to meson.build squash! build-sys: use #if Y instead of #ifdef Y everywhere v2: - fix incorrect setting of HAVE_LIBIDN2	2017-10-04 12:09:29 +02:00
Lennart Poettering	6c47cd7d3b	execute: make StateDirectory= and friends compatible with DynamicUser=1 and RootDirectory=/RootImage= Let's clean up the interaction of StateDirectory= (and friends) to DynamicUser=1: instead of creating these directories directly below /var/lib, place them in /var/lib/private instead if DynamicUser=1 is set, making that directory 0700 and owned by root:root. This way, if a dynamic UID is later reused, access to the old run's state directory is prohibited for that user. Then, use file system namespacing inside the service to make /var/lib/private a readable tmpfs, hiding all state directories that are not listed in StateDirectory=, and making access to the actual state directory possible. Mount all directories listed in StateDirectory= to the same places inside the service (which means they'll now be mounted into the tmpfs instance). Finally, add a symlink from the state directory name in /var/lib/ to the one in /var/lib/private, so that both the host and the service can access the path under the same location. Here's an example: let's say a service runs with StateDirectory=foo. When DynamicUser=0 is set, it will get the following setup, and no difference between what the unit and what the host sees: /var/lib/foo (created as directory) Now, if DynamicUser=1 is set, we'll instead get this on the host: /var/lib/private (created as directory with mode 0700, root:root) /var/lib/private/foo (created as directory) /var/lib/foo → private/foo (created as symlink) And from inside the unit: /var/lib/private (a tmpfs mount with mode 0755, root:root) /var/lib/private/foo (bind mounted from the host) /var/lib/foo → private/foo (the same symlink as above) This takes inspiration from how container trees are protected below /var/lib/machines: they generally reuse UIDs/GIDs of the host, but because /var/lib/machines itself is set to 0700 host users cannot access files in the container tree even if the UIDs/GIDs are reused. However, for this commit we add one further trick: inside and outside of the unit /var/lib/private is a different thing: outside it is a plain, inaccessible directory, and inside it is a world-readable tmpfs mount with only the whitelisted subdirs below it, bind mounte din. This means, from the outside the dir acts as an access barrier, but from the inside it does not. And the symlink created in /var/lib/foo itself points across the barrier in both cases, so that root and the unit's user always have access to these dirs without knowing the details of this mounting magic. This logic resolves a major shortcoming of DynamicUser=1 units: previously they couldn't safely store persistant data. With this change they can have their own private state, log and data directories, which they can write to, but which are protected from UID recycling. With this change, if RootDirectory= or RootImage= are used it is ensured that the specified state/log/cache directories are always mounted in from the host. This change of semantics I think is much preferable since this means the root directory/image logic can be used easily for read-only resource bundling (as all writable data resides outside of the image). Note that this is a change of behaviour, but given that we haven't released any systemd version with StateDirectory= and friends implemented this should be a safe change to make (in particular as previously it wasn't clear what would actually happen when used in combination). Moreover, by making this change we can later add a "+" modifier to these setings too working similar to the same modifier in ReadOnlyPaths= and friends, making specified paths relative to the container itself.	2017-10-02 17:41:44 +02:00
Lennart Poettering	a227a4be48	namespace: if we can create the destination of bind and PrivateTmp= mounts When putting together the namespace, always create the file or directory we are supposed to bind mount on, the same way we do it for most other stuff, for example mount units or systemd-nspawn's --bind= option. This has the big benefit that we can use namespace bind mounts on dirs in /tmp or /var/tmp even in conjunction with PrivateTmp=.	2017-10-02 17:41:43 +02:00
Lennart Poettering	e908468b5b	namespace: properly handle bind mounts from the host Before this patch we had an ordering problem: if we have no namespacing enabled except for two bind mounts that intend to swap /a and /b via bind mounts, then we'd execute the bind mount binding /b to /a, followed by thebind mount from /a to /b, thus having the effect that /b is now visible in both /a and /b, which was not intended. With this change, as soon as any bind mount is configured we'll put together the service mount namespace in a temporary directory instead of operating directly in the root. This solves the problem in a straightforward fashion: the source of bind mounts will always refer to the host, and thus be unaffected from the bind mounts we already created.	2017-10-02 17:41:43 +02:00
Lennart Poettering	645767d6b5	namespace: create /dev, /proc, /sys when needed We already create /dev implicitly if PrivateTmp=yes is on, if it is missing. Do so too for the other two API VFS, as well as for /dev if PrivateTmp=yes is off but MountAPIVFS=yes is on (i.e. when /dev is bind mounted from the host).	2017-10-02 17:41:43 +02:00
Topi Miettinen	07ce74074d	namespace: avoid assertion failure (#6649 ) If the root image is not decrypted, it must not be relinquished.	2017-08-29 17:31:24 +02:00
Nicolas Iooss	3a0bf6d6aa	namespace: keep selinuxfs mounted read-write with ProtectKernelTunables (#5741 ) When a service unit uses "ProtectKernelTunables=yes", it currently remounts /sys/fs/selinux read-only. This makes libselinux report SELinux state as "disabled", because most SELinux features are not usable. For example it is not possible to validate security contexts (with security_check_context_raw() or /sys/fs/selinux/context). This behavior of libselinux has been described in http://danwalsh.livejournal.com/73099.html and confirmed in a recent email, https://marc.info/?l=selinux&m=149220233032594&w=2 . Since commit `0c28d51ac8` ("units: further lock down our long-running services"), systemd-localed unit uses ProtectKernelTunables=yes. Nevertheless this service needs to use libselinux API in order to create /etc/vconsole.conf, /etc/locale.conf... with the right SELinux contexts. This is broken when /sys/fs/selinux is mounted read-only in the mount namespace of the service. Make SELinux-aware systemd services work again when they are using ProtectKernelTunables=yes by keeping selinuxfs mounted read-write.	2017-07-31 17:45:33 +02:00

1 2 3 4 5 ...

274 commits