Commit Graph

260 Commits

Author SHA1 Message Date
Zbigniew Jędrzejewski-Szmek 2aed63f427 tree-wide: fix spelling of "fallback"
Similarly to "setup" vs. "set up", "fallback" is a noun, and "fall back"
is the verb. (This is pretty clear when we construct a sentence in the
present continous: "we are falling back" not "we are fallbacking").
2020-08-20 17:45:32 +02:00
Lennart Poettering 3f181262f4 namespace: fix minor memory leak 2020-08-14 15:33:04 +02:00
Luca Boccassi b3d133148e core: new feature MountImages
Follows the same pattern and features as RootImage, but allows an
arbitrary mount point under / to be specified by the user, and
multiple values - like BindPaths.

Original implementation by @topimiettinen at:
https://github.com/systemd/systemd/pull/14451
Reworked to use dissect's logic instead of bare libmount() calls
and other review comments.
Thanks Topi for the initial work to come up with and implement
this useful feature.
2020-08-05 21:34:55 +01:00
Zbigniew Jędrzejewski-Szmek 7e62257219
Merge pull request #16308 from bluca/root_image_options
service: add new RootImageOptions feature
2020-08-03 10:04:36 +02:00
Zbigniew Jędrzejewski-Szmek b67ec8e5b2 pid1: stop limiting size of /dev/shm
The explicit limit is dropped, which means that we return to the kernel default
of 50% of RAM. See 362a55fc14 for a discussion why that is not as much as it
seems. It turns out various applications need more space in /dev/shm and we
would break them by imposing a low limit.

While at it, rename the define and use a single macro for various tmpfs mounts.
We don't really care what the purpose of the given tmpfs is, so it seems
reasonable to use a single macro.

This effectively reverts part of 7d85383edb. Fixes #16617.
2020-07-30 18:48:35 +02:00
Luca Boccassi 18d7370587 service: add new RootImageOptions feature
Allows to specify mount options for RootImage.
In case of multi-partition images, the partition number can be prefixed
followed by colon. Eg:

RootImageOptions=1:ro,dev 2:nosuid nodev

In absence of a partition number, 0 is assumed.
2020-07-29 17:17:32 +01:00
Zbigniew Jędrzejewski-Szmek 6cdc429454
Merge pull request #16340 from keszybz/var-tmp-readonly
Create ro private /var/tmp dir when /var/tmp is read-only
2020-07-14 19:59:48 +02:00
Zbigniew Jędrzejewski-Szmek 56a13a495c pid1: create ro private tmp dirs when /tmp or /var/tmp is read-only
Read-only /var/tmp is more likely, because it's backed by a real device. /tmp
is (by default) backed by tmpfs, but it doesn't have to be. In both cases the
same consideration applies.

If we boot with read-only /var/tmp, any unit with PrivateTmp=yes would fail
because we cannot create the subdir under /var/tmp to mount the private directory.
But many services actually don't require /var/tmp (either because they only use
it occasionally, or because they only use /tmp, or even because they don't use the
temporary directories at all, and PrivateTmp=yes is used to isolate them from
the rest of the system).

To handle both cases let's create a read-only directory under /run/systemd and
mount it as the private /tmp or /var/tmp. (Read-only to not fool the service into
dumping too much data in /run.)

$ sudo systemd-run -t -p PrivateTmp=yes bash
Running as unit: run-u14.service
Press ^] three times within 1s to disconnect TTY.
[root@workstation /]# ls -l /tmp/
total 0
[root@workstation /]# ls -l /var/tmp/
total 0
[root@workstation /]# touch /tmp/f
[root@workstation /]# touch /var/tmp/f
touch: cannot touch '/var/tmp/f': Read-only file system

This commit has more changes than I like to put in one commit, but it's touching all
the same paths so it's hard to split.
exec_runtime_make() was using the wrong cleanup function, so the directory would be
left behind on error.
2020-07-14 19:47:15 +02:00
Christian Göttsche f2df56bfea namespace: unify logging in mount_tmpfs
Fixes: abad72be4d
Follow up: #16426
2020-07-11 21:25:39 +02:00
Christian Göttsche abad72be4d namespace: fix MAC labels of TemporaryFileSystem=
Reproducible with:
  systemd-run -p TemporaryFileSystem=/root -t /bin/bash
    ls -dZ /root

Prior:
  root:object_r:tmpfs_t:s0 /root
Past:
  root:object_r:user_home_dir_t:s0 /root
2020-07-11 00:09:05 +02:00
Zbigniew Jędrzejewski-Szmek cbc056c819 core: wrap some long lines and other formatting changes 2020-07-08 16:37:23 +02:00
Alan Perry 5dc60faae5 add error message when bind mount src missing 2020-07-07 20:04:19 +02:00
Zbigniew Jędrzejewski-Szmek 37b22b3b47 tree: wide "the the" and other trivial grammar fixes 2020-07-02 09:51:38 +02:00
Luca Boccassi d4d55b0d13 core: add RootHashSignature service parameter
Allow to explicitly pass root hash signature as a unit option. Takes precedence
over implicit checks.
2020-06-25 08:45:21 +01:00
Luca Boccassi c2923fdcd7 dissect/nspawn: add support for dm-verity root hash signature
Since cryptsetup 2.3.0 a new API to verify dm-verity volumes by a
pkcs7 signature, with the public key in the kernel keyring,
is available. Use it if libcryptsetup supports it.
2020-06-25 08:45:21 +01:00
Lennart Poettering 6b000af4f2 tree-wide: avoid some loaded terms
https://tools.ietf.org/html/draft-knodel-terminology-02
https://lwn.net/Articles/823224/

This gets rid of most but not occasions of these loaded terms:

1. scsi_id and friends are something that is supposed to be removed from
   our tree (see #7594)

2. The test suite defines an API used by the ubuntu CI. We can remove
   this too later, but this needs to be done in sync with the ubuntu CI.

3. In some cases the terms are part of APIs we call or where we expose
   concepts the kernel names the way it names them. (In particular all
   remaining uses of the word "slave" in our codebase are like this,
   it's used by the POSIX PTY layer, by the network subsystem, the mount
   API and the block device subsystem). Getting rid of the term in these
   contexts would mean doing some major fixes of the kernel ABI first.

Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
2020-06-25 09:00:19 +02:00
Luca Boccassi 0389f4fa81 core: add RootHash and RootVerity service parameters
Allow to explicitly pass root hash (explicitly or as a file) and verity
device/file as unit options. Take precedence over implicit checks.
2020-06-23 10:50:09 +02:00
Zbigniew Jędrzejewski-Szmek 9664be199a
Merge pull request #16118 from poettering/inaccessible-fixlets
move $XDG_RUNTIME_DIR/inaccessible/ to $XDG_RUNTIME_DIR/systemd/inaccessible
2020-06-10 10:23:13 +02:00
Lennart Poettering 48b747fa03 inaccessible: move inaccessible file nodes to /systemd/ subdir in runtime dir always
Let's make sure $XDG_RUNTIME_DIR for the user instance and /run for the
system instance is always organized the same way: the "inaccessible"
device nodes should be placed in a subdir of either called "systemd" and
a subdir of that called "inaccessible".

This way we can emphasize the common behaviour, and only differ where
really necessary.

Follow-up for #13823
2020-06-09 16:23:56 +02:00
Luca Boccassi e7cbe5cb9e dissect: support single-filesystem verity images with external verity hash
dm-verity support in dissect-image at the moment is restricted to GPT
volumes.
If the image a single-filesystem type without a partition table (eg: squashfs)
and a roothash/verity file are passed, set the verity flag and mark as
read-only.
2020-06-09 12:19:21 +01:00
Topi Miettinen 7d85383edb tree-wide: add size limits for tmpfs mounts
Limit size of various tmpfs mounts to 10% of RAM, except volatile root and /var
to 25%. Another exception is made for /dev (also /devs for PrivateDevices) and
/sys/fs/cgroup since no (or very few) regular files are expected to be used.

In addition, since directories, symbolic links, device specials and xattrs are
not counted towards the size= limit, number of inodes is also limited
correspondingly: 4MB size translates to 1k of inodes (assuming 4k each), 10% of
RAM (using 16GB of RAM as baseline) translates to 400k and 25% to 1M inodes.

Because nr_inodes option can't use ratios like size option, there's an
unfortunate side effect that with small memory systems the limit may be on the
too large side. Also, on an extremely small device with only 256MB of RAM, 10%
of RAM for /run may not be enough for re-exec of PID1 because 16MB of free
space is required.
2020-05-13 00:37:18 +02:00
Lennart Poettering 0cd41757d0 sd-bus: work around ubsan warning
ubsan complains that we add an offset to a NULL ptr here in some cases.
Which isn't really a bug though, since we only use it as the end
condition for a for loop, but we can still fix it...

Fixes: #15522
2020-04-23 08:54:30 +02:00
Topi Miettinen c3151977d7 namespace: fix MAC labels of /dev when PrivateDevices=yes
Without changing the SELinux label for private /dev of a service, it will take
a generic file system label:
system_u:object_r:tmpfs_t:s0

After this change it is the same as without `PrivateDevices=yes`:
system_u:object_r:device_t:s0

This helps writing SELinux policies, as the same rules for `/dev` will apply
despite any `PrivateDevices=yes` setting.
2020-03-12 08:23:27 +00:00
Topi Miettinen de46b2be07
namespace: ignore prefix chars when comparing paths
Other callers of path_strv_contains() or PATH_IN_SET() don't seem to handle
paths prefixed with -+.
2020-03-10 16:48:34 +02:00
Zbigniew Jędrzejewski-Szmek 105a1a36cd tree-wide: fix spelling of lookup and setup verbs
"set up" and "look up" are the verbs, "setup" and "lookup" are the nouns.
2020-03-03 15:02:53 +01:00
Topi Miettinen aeac9dd647 Revert "namespace: fix MAC labels of /dev when PrivateDevices=yes"
This reverts commit e6e81ec0a5.
2020-02-29 23:35:43 +09:00
Topi Miettinen e6e81ec0a5 namespace: fix MAC labels of /dev when PrivateDevices=yes
Without changing the SELinux label for private /dev of a service, it will take
a generic file system label:
system_u:object_r:tmpfs_t:s0

After this change it is the same as without `PrivateDevices=yes`:
system_u:object_r:device_t:s0

This helps writing SELinux policies, as the same rules for `/dev` will apply
despite any `PrivateDevices=yes` setting.
2020-02-28 14:17:48 +00:00
Christian Göttsche 1acf344dfa core: do not prepare a SELinux context for dummy files for devicenode bind-mounting
Let systemd create the dummy file where a device node will be mounted on with the default label for the parent directory (e.g. /tmp/namespace-dev-yTMwAe/dev/).

Fixes: #13762
2020-02-06 10:20:14 +01:00
Lennart Poettering 91dd5f7cbe core: add new LogNamespace= execution setting 2020-01-31 15:01:43 +01:00
Lennart Poettering 575a915a74
Merge pull request #14532 from poettering/namespace-dynamic-user-fix
Make DynamicUser=1 work in a userns container
2020-01-13 16:47:15 +01:00
Lennart Poettering 7cce68e1e0 core: make sure we use the correct mount flag when re-mounting bind mounts
When in a userns environment we cannot take away per-mount point flags
set on a mount point that was passed to us. Hence we need to be careful
to always check the actual mount flags in place and manipulate only
those flags of them that we actually want to change and not reset more
as side-effect.

We mostly got this right already in
bind_remount_recursive_with_mountinfo(), but didn't in the simpler
bind_remount_one_with_mountinfo(). Catch up.

(The old code assumed that the MountEntry.flags field contained the
right flag settings, but it actually doesn't for new mounts we just
established as for those mount() establishes the initial flags for us,
and we have to read them back to figure out which ones the kernel
picked.)

Fixes: #13622
2020-01-09 15:18:08 +01:00
Lennart Poettering b0a94268f8 core: when we cannot open an image file for write, try read-only
Closes: #14442
2020-01-09 11:18:06 +01:00
Lennart Poettering c8c535d589 namespace: tweak checks whether we can mount image read-only
So far we set up a loopback file read-only iff ProtectSystem= and
ProtectHome= both where set to values that mark these dirs read-only.
Let's extend that and also be happy if /home and the root dir are marked
read-only by some other means.

Fixes: #14442
2020-01-09 11:18:02 +01:00
Anita Zhang e5f10cafe0 core: create inaccessible nodes for users when making runtime dirs
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
2019-12-18 11:09:30 -08:00
Anita Zhang adae5eb977
Merge pull request #14219 from poettering/homed-preparatory-loop
preparatory /dev/loopN support split out of homed PR
2019-12-04 16:07:41 -08:00
Jérémy Rosen a652f050a7 Create parent directories when creating systemd-private subdirs
This is needed when systemd is compiled without systemd-tmpfiles
2019-12-04 09:22:52 +01:00
Lennart Poettering e08f94acf5 loop-util: accept loopback flags when creating loopback device
This way callers can choose if they want partition scanning or not.
2019-12-02 10:05:09 +01:00
Kevin Kuehler 94a7b2759d core: ProtectKernelLogs= mask kmsg in proc and sys
Block access to /dev/kmsg and /proc/kmsg when ProtectKernelLogs is set.
2019-11-14 12:58:43 -08:00
Yu Watanabe e30e8b5073 tree-wide: drop stat.h or statfs.h when stat-util.h is included 2019-11-04 00:30:32 +09:00
Yu Watanabe 455fa9610c tree-wide: drop string.h when string-util.h or friends are included 2019-11-04 00:30:32 +09:00
Yu Watanabe f5947a5e92 tree-wide: drop missing.h 2019-10-31 17:57:03 +09:00
Zbigniew Jędrzejewski-Szmek a5648b8094 basic/fs-util: change CHASE_OPEN flag into a separate output parameter
chase_symlinks() would return negative on error, and either a non-negative status
or a non-negative fd when CHASE_OPEN was given. This made the interface quite
complicated, because dependning on the flags used, we would get two different
"types" of return object. Coverity was always confused by this, and flagged
every use of chase_symlinks() without CHASE_OPEN as a resource leak (because it
would this that an fd is returned). This patch uses a saparate output parameter,
so there is no confusion.

(I think it is OK to have functions which return either an error or an fd. It's
only returning *either* an fd or a non-fd that is confusing.)
2019-10-24 22:44:24 +09:00
Lennart Poettering 2caa38e99f tree-wide: some more [static] related fixes
let's add [static] where it was missing so far

Drop [static] on parameters that can be NULL.

Add an assert() around parameters that have [static] and can't be NULL
hence.

Add some "const" where it was forgotten.
2019-07-12 16:40:10 +02:00
Lennart Poettering c6134d3e2f path-util: get rid of prefix_root()
prefix_root() is equivalent to path_join() in almost all ways, hence
let's remove it.

There are subtle differences though: prefix_root() will try shorten
multiple "/" before and after the prefix. path_join() doesn't do that.
This means prefix_root() might return a string shorter than both its
inputs combined, while path_join() never does that. I like the
path_join() semantics better, hence I think dropping prefix_root() is
totally OK. In the end the strings generated by both functon should
always be identical in terms of path_equal() if not streq().

This leaves prefix_roota() in place. Ideally we'd have path_joina(), but
I don't think we can reasonably implement that as a macro. or maybe we
can? (if so, sounds like something for a later PR)

Also add in a few missing OOM checks
2019-06-21 08:42:55 +09:00
Zbigniew Jędrzejewski-Szmek 7cc5ef5f18 pid1: improve message when setting up namespace fails
I covered the most obvious paths: those where there's a clear problem
with a path specified by the user.

Prints something like this (at error level):
May 21 20:00:01.040418 systemd[125871]: bad-workdir.service: Failed to set up mount namespacing: /run/systemd/unit-root/etc/tomcat9/Catalina: No such file or directory
May 21 20:00:01.040456 systemd[125871]: bad-workdir.service: Failed at step NAMESPACE spawning /bin/true: No such file or directory

Fixes #10972.
2019-05-22 16:28:02 +02:00
Lennart Poettering 6990fb6bc6 tree-wide: (void)ify a few unlink() and rmdir()
Let's be helpful to static analyzers which care about whether we
knowingly ignore return values. We do in these cases, since they are
usually part of error paths.
2019-03-27 18:09:56 +01:00
Lennart Poettering 9ce4e4b0f6 namespace: when DynamicUser=1 is set, mount StateDirectory= bind mounts "nosuid"
Add even more suid/sgid protection to DynamicUser= envionments: the
state directories we bind mount from the host will now have the nosuid
flag set, to disable the effect of nosuid on them.
2019-03-25 19:57:15 +01:00
Lennart Poettering 64e82c1976 mount-util: beef up bind_remount_recursive() to be able to toggle more than MS_RDONLY
The function is otherwise generic enough to toggle other bind mount
flags beyond MS_RDONLY (for example: MS_NOSUID or MS_NODEV), hence let's
beef it up slightly to support that too.
2019-03-25 19:33:55 +01:00
Lennart Poettering 867189b545 namespace: get rid of {} around single-line if blocks 2019-03-25 19:33:55 +01:00
Lennart Poettering 39e91a2777 namespace: get rid of local variable 2019-03-25 19:33:55 +01:00