Similarly to "setup" vs. "set up", "fallback" is a noun, and "fall back"
is the verb. (This is pretty clear when we construct a sentence in the
present continous: "we are falling back" not "we are fallbacking").
Previously, we'd create them from user-runtime-dir@.service. That has
one benefit: since this service runs privileged, we can create the full
set of device nodes. It has one major drawback though: it security-wise
problematic to create files/directories in directories as privileged
user in directories owned by unprivileged users, since they can use
symlinks to redirect what we want to do. As a general rule we hence
avoid this logic: only unpriv code should populate unpriv directories.
Hence, let's move this code to an appropriate place in the service
manager. This means we lose the inaccessible block device node, but
since there's already a fallback in place, this shouldn't be too bad.
Let's make /run/host the sole place we pass stuff from host to container
in and place the "inaccessible" nodes in /run/host too.
In contrast to the previous two commits this is a minor compat break, but
not a relevant one I think. Previously the container manager would place
these nodes in /run/systemd/inaccessible/ and that's where PID 1 in the
container would try to add them too when missing. Container manager and
PID 1 in the container would thus manage the same dir together.
With this change the container manager now passes an immutable directory
to the container and leaves /run/systemd entirely untouched, and managed
exclusively by PID 1 inside the container, which is nice to have clear
separation on who manages what.
In order to make sure systemd then usses the /run/host/inaccesible/
nodes this commit changes PID 1 to look for that dir and if it exists
will symlink it to /run/systemd/inaccessible.
Now, this will work fine if new nspawn and new pid 1 in the container
work together. as then the symlink is created and the difference between
the two dirs won't matter.
For the case where an old nspawn invokes a new PID 1: in this case
things work as they always worked: the dir is managed together.
For the case where different container manager invokes a new PID 1: in
this case the nodes aren't typically passed in, and PID 1 in the
container will try to create them and will likely fail partially (though
gracefully) when trying to create char/block device nodes. THis is fine
though as there are fallbacks in place for that case.
For the case where a new nspawn invokes an old PID1: this is were the
(minor) incompatibily happens: in this case new nspawn will place the
nodes in the /run/host/inaccessible/ subdir, but the PID 1 in the
container won't look for them there. Since the nodes are also not
pre-created in /run/systed/inaccessible/ PID 1 will try to create them
there as if a different container manager sets them up. This is of
course not sexy, but is not a total loss, since as mentioned fallbacks
are in place anyway. Hence I think it's OK to accept this minor
incompatibility.
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.
I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
This way, we can extend the macro a bit with stuff pulled in from other
headers without this affecting everything which pulls in macro.h, which
is one of our most basic headers.
This is just refactoring, no change in behaviour, in prepartion for
later changes.
I think this is a slightly cleaner approach than parsing the
configuration file at multiple places, as this way there's only a single
reload cycle for logind.conf, and that's systemd-logind.service's
runtime.
This means that logind and dbus become a requirement of
user-runtime-dir, but given that XDG_RUNTIME_DIR is not set anyway
without logind and dbus around this isn't really any limitation.
This also simplifies linking a bit as this means user-runtime-dir
doesn't have to link against any code of logind itself.
Fix#9993. When this code was split out to user-runtime-dir, it forgot to
include the call to mac_selinux_init(). So mkdir_label() stopped working.
Fixes: a9f0f5e501 ("logind: split %t directory creation to a helper
unit")
Let's fold get_user_creds_clean() into get_user_creds(), and introduce a
flags argument for it to select "clean" behaviour. This flags parameter
also learns to other new flags:
- USER_CREDS_SYNTHESIZE_FALLBACK: in this mode the user records for
root/nobody are only synthesized as fallback. Normally, the synthesized
records take precedence over what is in the user database. With this
flag set this is reversed, and the user database takes precedence, and
the synthesized records are only used if they are missing there. This
flag should be set in cases where doing NSS is deemed safe, and where
there's interest in knowing the correct shell, for example if the
admin changed root's shell to zsh or suchlike.
- USER_CREDS_ALLOW_MISSING: if set, and a UID/GID is specified by
numeric value, and there's no user/group record for it accept it
anyway. This allows us to fix#9767
This then also ports all users to set the most appropriate flags.
Fixes: #9767
[zj: remove one isempty() call]
As the comments already say it might be quite likely that
$XDG_RUNTIME_DIR is not set up as mount, and we shouldn't complain about
that.
Moreover, let's make this idempotent, so that a runtime dir that is
already gone and is removed again doesn't cause failure.
When unmounting user runtime directory, only UID is necessary,
and the corresponding user may not exist anymore.
This makes first try to parse the input by parse_uid(), and only if it
fails, prase the input by get_user_creds().
Fixes#9541.
Unfortunately this needs a new binary to do the mount because there's just
too many special steps to outsource this to systemd-mount:
- EPERM needs to be treated specially
- UserRuntimeDir= setting must be obeyed
- SELinux label must be adjusted
This allows user@.service to be started independently of logind.
So 'systemctl start user@nnn' will start the user manager for user nnn.
Logind will start it too when the user logs in, and will stop it (unless
lingering is enabled) when the user logs out.
Fixes#7339.