Commit graph

137 commits

Author SHA1 Message Date
Yu Watanabe db9ecf0501 license: LGPL-2.1+ -> LGPL-2.1-or-later 2020-11-09 13:23:58 +09:00
Lennart Poettering 30f5d10421 mount-util: rework umount_verbose() to take log level and flags arg
Let's make umount_verbose() more like mount_verbose_xyz(), i.e. take log
level and flags param. In particular the latter matters, since we
typically don't actually want to follow symlinks when unmounting.
2020-09-23 18:57:36 +02:00
Lennart Poettering 511a8cfe30 mount-util: switch most mount_verbose() code over to not follow symlinks 2020-09-23 18:57:36 +02:00
Zbigniew Jędrzejewski-Szmek 90e30d767a Rename strv_split_extract() to strv_split_full()
Now that _full() is gone, we can rename _extract() to have the usual suffix
we use for the more featureful version.
2020-09-09 09:34:55 +02:00
Zbigniew Jędrzejewski-Szmek b67ec8e5b2 pid1: stop limiting size of /dev/shm
The explicit limit is dropped, which means that we return to the kernel default
of 50% of RAM. See 362a55fc14 for a discussion why that is not as much as it
seems. It turns out various applications need more space in /dev/shm and we
would break them by imposing a low limit.

While at it, rename the define and use a single macro for various tmpfs mounts.
We don't really care what the purpose of the given tmpfs is, so it seems
reasonable to use a single macro.

This effectively reverts part of 7d85383edb. Fixes #16617.
2020-07-30 18:48:35 +02:00
Lennart Poettering d64e32c245 nspawn: rework how /run/host/ is set up
Let's find the right os-release file on the host side, and only mount
the one that matters, i.e. /etc/os-release if it exists and
/usr/lib/os-release otherwise. Use the fixed path /run/host/os-release
for that.

Let's also mount /run/host as a bind mount on itself before we set up
/run/host, and let's mount it MS_RDONLY after we are done, so that it
remains immutable as a whole.
2020-07-23 18:47:38 +02:00
Luca Boccassi 14f1c47a0c nspawn: mount os-release in two steps to make it read-only
The kernel interface requires setting up read-only bind-mounts in
two steps, the bind first and then a read-only remount.
Fix nspawn-mount, and cover this case in the integration test.

Fixes #16484
2020-07-16 09:59:59 +01:00
Luca Boccassi eafc7d6056 nspawn: use access/F_OK instead of stat to check for file existence 2020-07-16 09:59:59 +01:00
Luca Boccassi e1bb4b0d1d nspawn: implement container host os-release interface 2020-06-23 12:58:21 +01:00
Luca Boccassi b3b1a08a56 nspawn: use mkdir_p_safe instead of homegrown version 2020-06-23 12:57:05 +01:00
Lennart Poettering 6fe01ced0e nspawn: mkdir selinux mount point once, but not twice
Since #15533 we didn't create the mount point for selinuxfs anymore.

Before it we created it twice because we mount selinuxfs twice: once the
superblock, and once we remount its bind mound read-only. The second
mkdir would mean we'd chown() the host version of selinuxfs (since
there's only one selinuxfs superblock kernel-wide).

The right time to create mount point point is once: before we mount the
selinuxfs. But not a second time for the remount.

Fixes: #16032
2020-06-23 10:17:36 +02:00
Lennart Poettering 48b747fa03 inaccessible: move inaccessible file nodes to /systemd/ subdir in runtime dir always
Let's make sure $XDG_RUNTIME_DIR for the user instance and /run for the
system instance is always organized the same way: the "inaccessible"
device nodes should be placed in a subdir of either called "systemd" and
a subdir of that called "inaccessible".

This way we can emphasize the common behaviour, and only differ where
really necessary.

Follow-up for #13823
2020-06-09 16:23:56 +02:00
Topi Miettinen 7d85383edb tree-wide: add size limits for tmpfs mounts
Limit size of various tmpfs mounts to 10% of RAM, except volatile root and /var
to 25%. Another exception is made for /dev (also /devs for PrivateDevices) and
/sys/fs/cgroup since no (or very few) regular files are expected to be used.

In addition, since directories, symbolic links, device specials and xattrs are
not counted towards the size= limit, number of inodes is also limited
correspondingly: 4MB size translates to 1k of inodes (assuming 4k each), 10% of
RAM (using 16GB of RAM as baseline) translates to 400k and 25% to 1M inodes.

Because nr_inodes option can't use ratios like size option, there's an
unfortunate side effect that with small memory systems the limit may be on the
too large side. Also, on an extremely small device with only 256MB of RAM, 10%
of RAM for /run may not be enough for re-exec of PID1 because 16MB of free
space is required.
2020-05-13 00:37:18 +02:00
Lennart Poettering dcff2fa5d1 nspawn: be more careful with creating/chowning directories to overmount
We should never re-chown selinuxfs.

Fixes: #15475
2020-04-28 19:40:46 +02:00
Yu Watanabe 9610210d32 nspawn: voidify umount_verbose()
Fixes CID#1415122.
2020-01-31 23:10:29 +09:00
Daan De Meyer bbd407ea2b nspawn: Don't mount read-only if we have a custom mount on root. 2020-01-03 14:06:38 +01:00
Anita Zhang e5f10cafe0 core: create inaccessible nodes for users when making runtime dirs
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
2019-12-18 11:09:30 -08:00
Lennart Poettering d0556c55e7 nspawn: fix overlay with automatic temporary tree
This makes --overlay=+/foobar::/foobar work again, i.e. where the middle
parameter is left out. According to the documentation this is supposed
to generate a temporary writable work place in the midle. But it
apparently never did. Weird.
2019-12-13 15:11:38 +01:00
Daan De Meyer bd6609eb11 nspawn-mount: Use FLAGS_SET to check flags. 2019-12-12 20:18:37 +01:00
Daan De Meyer e091a5dfd1 nspawn-mount: Remove unused parameters 2019-12-12 20:15:10 +01:00
Daan De Meyer 5f0a6347ac nspawn: Enable specifying root as the mount target directory.
Fixes #3847.
2019-12-12 20:15:03 +01:00
Zbigniew Jędrzejewski-Szmek a5648b8094 basic/fs-util: change CHASE_OPEN flag into a separate output parameter
chase_symlinks() would return negative on error, and either a non-negative status
or a non-negative fd when CHASE_OPEN was given. This made the interface quite
complicated, because dependning on the flags used, we would get two different
"types" of return object. Coverity was always confused by this, and flagged
every use of chase_symlinks() without CHASE_OPEN as a resource leak (because it
would this that an fd is returned). This patch uses a saparate output parameter,
so there is no confusion.

(I think it is OK to have functions which return either an error or an fd. It's
only returning *either* an fd or a non-fd that is confusing.)
2019-10-24 22:44:24 +09:00
Frantisek Sumsal 38288f0bb8 tree-wide: various code-formatting improvements
Reported/found by Coccinelle
2019-09-22 07:17:27 +02:00
Lennart Poettering 07b9f3f03c nspawn: print an explanatory error when people try to use --volatile=yes on distros that are not /usr-merged 2019-07-29 11:30:47 +02:00
Iago López Galeiras a11fd4067b Revert "nspawn: remove unnecessary mount option parsing logic"
This reverts commit 72d967df3e.

Revert this because it broke the `norbind` option of the bind flags
because it does bind-mounts unconditionally recursive.

Let's bring the old logic back.

Fixes: #13170
2019-07-24 17:17:42 +02:00
Lennart Poettering cee97d5768
Merge pull request #12836 from yuwata/tree-wide-replace-strjoin
tree-wide: replace strjoin() with path_join()
2019-06-22 20:02:46 +02:00
Lennart Poettering c6134d3e2f path-util: get rid of prefix_root()
prefix_root() is equivalent to path_join() in almost all ways, hence
let's remove it.

There are subtle differences though: prefix_root() will try shorten
multiple "/" before and after the prefix. path_join() doesn't do that.
This means prefix_root() might return a string shorter than both its
inputs combined, while path_join() never does that. I like the
path_join() semantics better, hence I think dropping prefix_root() is
totally OK. In the end the strings generated by both functon should
always be identical in terms of path_equal() if not streq().

This leaves prefix_roota() in place. Ideally we'd have path_joina(), but
I don't think we can reasonably implement that as a macro. or maybe we
can? (if so, sounds like something for a later PR)

Also add in a few missing OOM checks
2019-06-21 08:42:55 +09:00
Yu Watanabe 657ee2d82b tree-wide: replace strjoin() with path_join() 2019-06-21 03:26:16 +09:00
Zbigniew Jędrzejewski-Szmek ca78ad1de9 headers: remove unneeded includes from util.h
This means we need to include many more headers in various files that simply
included util.h before, but it seems cleaner to do it this way.
2019-03-27 11:53:12 +01:00
Zbigniew Jędrzejewski-Szmek e1af3bc62a
Merge pull request #12106 from poettering/nosuidns
add "nosuid" flag to exec directory mounts of DynamicUser=1 services
2019-03-26 08:58:00 +01:00
Lennart Poettering 849b9b85b8 nspawn: mount mqueue with nodev,noexec,nosuid, too
The host mounts it like that, nspawn hence should do too.

Moreover, mount the file system after doing CLONEW_NEWIPC so that it
actually reflects the right mqueues. Finally, mount it wthout
considering it fatal, since POSIX mqueue support is little used and it
should be fine not to support it in the kernel.
2019-03-25 19:53:05 +01:00
Lennart Poettering 64e82c1976 mount-util: beef up bind_remount_recursive() to be able to toggle more than MS_RDONLY
The function is otherwise generic enough to toggle other bind mount
flags beyond MS_RDONLY (for example: MS_NOSUID or MS_NODEV), hence let's
beef it up slightly to support that too.
2019-03-25 19:33:55 +01:00
Lennart Poettering 2c9b7a7e62 mount: when we fail to establish an inaccessible mount gracefully, undo the mount 2019-03-21 12:41:02 +01:00
Zbigniew Jędrzejewski-Szmek d0b6a10c00
Merge pull request #9762 from poettering/nspawn-oci
OCI runtime support for nspawn
2019-03-21 11:01:53 +01:00
Yu Watanabe 1d0c1146ea nspawn: fix memleak
Fixes oss-fuzz#13691.
2019-03-15 23:53:05 +09:00
Lennart Poettering de40a3037a nspawn: add support for executing OCI runtime bundles with nspawn
This is a pretty large patch, and adds support for OCI runtime bundles
to nspawn. A new switch --oci-bundle= is added that takes a path to an
OCI bundle. The JSON file included therein is read similar to a .nspawn
settings files, however with a different feature set.

Implementation-wise this mostly extends the pre-existing Settings object
to carry additional properties for OCI. However, OCI supports some
concepts .nspawn files did not support yet, which this patch also adds:

1. Support for "masking" files and directories. This functionatly is now
   also available via the new --inaccesible= cmdline command, and
   Inaccessible= in .nspawn files.

2. Support for mounting arbitrary file systems. (not exposed through
   nspawn cmdline nor .nspawn files, because probably not a good idea)

3. Ability to configure the console settings for a container. This
   functionality is now also available on the nspawn cmdline in the new
   --console= switch (not added to .nspawn for now, as it is something
   specific to the invocation really, not a property of the container)

4. Console width/height configuration. Not exposed through
   .nspawn/cmdline, but this may be controlled through $COLUMNS and
   $LINES like in most other UNIX tools.

5. UID/GID configuration by raw numbers. (not exposed in .nspawn and on
   the cmdline, since containers likely have different user tables, and
   the existing --user= switch appears to be the better option)

6. OCI hook commands (no exposed in .nspawn/cmdline, as very specific to
   OCI)

7. Creation of additional devices nodes in /dev. Most likely not a good
   idea, hence not exposed in .nspawn/cmdline. There's already --bind=
   to achieve the same, which is the better alternative.

8. Explicit syscall filters. This is not a good idea, due to the skewed
   arch support, hence not exposed through .nspawn/cmdline.

9. Configuration of some sysctls on a whitelist. Questionnable, not
   supported in .nspawn/cmdline for now.

10. Configuration of all 5 types of capabilities. Not a useful concept,
    since the kernel will reduce the caps on execve() anyway. Not
    exposed through .nspawn/cmdline as this is not very useful hence.

Note that this only implements the OCI runtime logic itself. It does not
provide a runc-compatible command line tool. This is left for a later
PR. Only with that in place tools such as "buildah" can use the OCI
support in nspawn as drop-in replacement.

Currently still missing is OCI hook support, but it's already parsed and
everything, and should be easy to add. Other than that it's OCI is
implemented pretty comprehensively.

There's a list of incompatibilities in the nspawn-oci.c file. In a later
PR I'd like to convert this into proper markdown and add it to the
documentation directory.
2019-03-15 15:41:28 +01:00
Lennart Poettering 760877e90c util: split out sorting related calls to new sort-util.[ch] 2019-03-13 12:16:43 +01:00
Zbigniew Jędrzejewski-Szmek 0e636bf51a nspawn: fix memleak uncovered by fuzzer
Also use TAKE_PTR as appropriate.
2019-03-11 14:29:30 +01:00
Lennart Poettering 6c610acaaa nspawn: add --volatile=overlay support
Fixes: #11054 #3847
2019-03-01 14:11:06 +01:00
Lennart Poettering c55d0ae764 nspawn: fix an error path 2019-03-01 14:11:06 +01:00
Lennart Poettering e5b43a04b6 nspawn: add volatile mode multiplexer call setup_volatile_mode()
Just some refactoring, no change in behaviour.
2019-03-01 14:11:06 +01:00
Lennart Poettering 0646d3c3dd nspawn: explicitly refuse mounts over /
Previously this would fail later on, but let's filter this out at the
time of parsing.
2019-03-01 14:11:06 +01:00
Lennart Poettering e4de72876e util-lib: split out all temporary file related calls into tmpfiles-util.c
This splits out a bunch of functions from fileio.c that have to do with
temporary files. Simply to make the header files a bit shorter, and to
group things more nicely.

No code changes, just some rearranging of source files.
2018-12-02 13:22:29 +01:00
Zbigniew Jędrzejewski-Szmek b2ac2b01c8
Merge pull request #10996 from poettering/oci-prep
Preparation for the nspawn-OCI work
2018-11-30 10:09:00 +01:00
Zbigniew Jędrzejewski-Szmek 049af8ad0c Split out part of mount-util.c into mountpoint-util.c
The idea is that anything which is related to actually manipulating mounts is
in mount-util.c, but functions for mountpoint introspection are moved to the
new file. Anything which requires libmount must be in mount-util.c.

This was supposed to be a preparation for further changes, with no functional
difference, but it results in a significant change in linkage:

$ ldd build/libnss_*.so.2
(before)
build/libnss_myhostname.so.2:
	linux-vdso.so.1 (0x00007fff77bf5000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f4bbb7b2000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007f4bbb755000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4bbb734000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f4bbb56e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f4bbb8c1000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f4bbb51b000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f4bbb512000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f4bbb4e3000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f4bbb45e000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f4bbb458000)
build/libnss_mymachines.so.2:
	linux-vdso.so.1 (0x00007ffc19cc0000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fdecb74b000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fdecb744000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007fdecb6e7000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdecb6c6000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fdecb500000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fdecb8a9000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fdecb4ad000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fdecb4a2000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fdecb475000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007fdecb3f0000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fdecb3ea000)
build/libnss_resolve.so.2:
	linux-vdso.so.1 (0x00007ffe8ef8e000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fcf314bd000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fcf314b6000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007fcf31459000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fcf31438000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fcf31272000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fcf31615000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fcf3121f000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fcf31214000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fcf311e7000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007fcf31162000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fcf3115c000)
build/libnss_systemd.so.2:
	linux-vdso.so.1 (0x00007ffda6d17000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f610b83c000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007f610b835000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007f610b7d8000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f610b7b7000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f610b5f1000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f610b995000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f610b59e000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f610b593000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f610b566000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f610b4e1000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f610b4db000)

(after)
build/libnss_myhostname.so.2:
	linux-vdso.so.1 (0x00007fff0b5e2000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fde0c328000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fde0c307000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fde0c141000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fde0c435000)
build/libnss_mymachines.so.2:
	linux-vdso.so.1 (0x00007ffdc30a7000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f06ecabb000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007f06ecab4000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f06eca93000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f06ec8cd000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f06ecc15000)
build/libnss_resolve.so.2:
	linux-vdso.so.1 (0x00007ffe95747000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fa56a80f000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fa56a808000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa56a7e7000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fa56a621000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fa56a964000)
build/libnss_systemd.so.2:
	linux-vdso.so.1 (0x00007ffe67b51000)
	librt.so.1 => /lib64/librt.so.1 (0x00007ffb32113000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007ffb3210c000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffb320eb000)
	libc.so.6 => /lib64/libc.so.6 (0x00007ffb31f25000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ffb3226a000)

I don't quite understand what is going on here, but let's not be too picky.
2018-11-29 21:03:44 +01:00
Lennart Poettering 17c58ba97b nspawn: let's also pre-mount /dev/mqueue 2018-11-29 20:21:40 +01:00
Zbigniew Jędrzejewski-Szmek baaa35ad70 coccinelle: make use of SYNTHETIC_ERRNO
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.

I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
2018-11-22 10:54:38 +01:00
Lennart Poettering 1099ceebce nspawn: optionally don't mount a tmpfs over /tmp (#10294)
nspawn: optionally, don't mount a tmpfs on /tmp

Fixes: #10260
2018-10-08 18:32:03 +02:00
Yu Watanabe 93bab28895 tree-wide: use typesafe_qsort() 2018-09-19 08:02:52 +09:00
Franck Bui 03d0f4b58e nspawn: always use mode 555 for /sys
When a network namespace is needed, /sys is mounted as tmpfs (see commit
d8fc6a000f for details).

But in this case mode 755 was used as initial permissions for /sys whereas the
default mode for sysfs is 555.

In practice using 755 doesn't have any impact because /sys is mounted read-only
too but for consistency, let's use the correct mode.

Fixes: #10050
2018-09-11 00:34:00 +02:00