Let's make umount_verbose() more like mount_verbose_xyz(), i.e. take log
level and flags param. In particular the latter matters, since we
typically don't actually want to follow symlinks when unmounting.
On new kernels (>= 5.8) unprivileged users may create the 0:0 character
device node. Which is great, as we can use that as inaccessible device
nodes if we run unprivileged. Hence, change how we find the right
inaccessible device inodes: when the user asks for a block device node,
but we have none, try the char device node first. If that doesn't exist,
fall back to the socket node as before.
This means that:
1. in the best case we'll return a node if the right device node type
2. otherwise we hopefully at least can return a device node if one asked
for even if the type doesn't match (i.e. we return char instead of
the requested block device node)
3. in the worst case (old kernels…) we'll return a socket node
https://tools.ietf.org/html/draft-knodel-terminology-02https://lwn.net/Articles/823224/
This gets rid of most but not occasions of these loaded terms:
1. scsi_id and friends are something that is supposed to be removed from
our tree (see #7594)
2. The test suite defines an API used by the ubuntu CI. We can remove
this too later, but this needs to be done in sync with the ubuntu CI.
3. In some cases the terms are part of APIs we call or where we expose
concepts the kernel names the way it names them. (In particular all
remaining uses of the word "slave" in our codebase are like this,
it's used by the POSIX PTY layer, by the network subsystem, the mount
API and the block device subsystem). Getting rid of the term in these
contexts would mean doing some major fixes of the kernel ABI first.
Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
Let's make sure $XDG_RUNTIME_DIR for the user instance and /run for the
system instance is always organized the same way: the "inaccessible"
device nodes should be placed in a subdir of either called "systemd" and
a subdir of that called "inaccessible".
This way we can emphasize the common behaviour, and only differ where
really necessary.
Follow-up for #13823
If we're using a set with _put_strdup(), most of the time we want to use
string hash ops on the set, and free the strings when done. This defines
the appropriate a new string_hash_ops_free structure to automatically free
the keys when removing the set, and makes set_put_strdup() and set_put_strdupv()
instantiate the set with those hash ops.
hashmap_put_strdup() was already doing something similar.
(It is OK to instantiate the set earlier, possibly with a different hash ops
structure. set_put_strdup() will then use the existing set. It is also OK
to call set_free_free() instead of set_free() on a set with
string_hash_ops_free, the effect is the same, we're just overriding the
override of the cleanup function.)
No functional change intended.
When in a userns environment we cannot take away per-mount point flags
set on a mount point that was passed to us. Hence we need to be careful
to always check the actual mount flags in place and manipulate only
those flags of them that we actually want to change and not reset more
as side-effect.
We mostly got this right already in
bind_remount_recursive_with_mountinfo(), but didn't in the simpler
bind_remount_one_with_mountinfo(). Catch up.
(The old code assumed that the MountEntry.flags field contained the
right flag settings, but it actually doesn't for new mounts we just
established as for those mount() establishes the initial flags for us,
and we have to read them back to figure out which ones the kernel
picked.)
Fixes: #13622
This cleans up the function in multiple ways:
- change order of parameters to follow our usualy system of putting
return parameters last
- rename return parameter "ret" as we usually do
- don't initialize local variables we override immediately anyway
- downgrade log messages to LOG_DEBUG (since we don't log about any
other errors here above LOG_DEBUG, as this is mostly an "API"-style
function)
- handle that mnt_fs_get_vfs_options() may return NULL (according to
docs)
- manually map the ST_xyz to MS_xyz flags on statvfs(), because while
they are mostly the same, they aren't entirely the same, MS_RELATIME and
ST_RELATIME are defined differently (sad!)
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
The commit
"util: Do not clear parent mount flags when setting up namespaces"
introduced a statvfs call read the flags of the original mount
and have them applied to the bind mount.
This has two problems:
(1) The mount flags returned by statvfs(2) do not match the flags
accepted by mount(2). For example, the value 4096 means ST_RELATIME
when returned by statvfs(2), but means MS_BIND when passed to mount(2).
(2) A call to statvfs blocks indefinitely when ran against a disconnected
network drive ( https://github.com/systemd/systemd/issues/12667 ).
We already use libmount to parse `/proc/self/mountinfo` but did not use the
mount flag information from there. This patch changes that to use the mount
flags parsed by libmount instead of calling statvfs. Only if getting the
flags through libmount fails we call statvfs.
Fixes https://github.com/systemd/systemd/issues/12667
Whenever I see EXTRACT_QUOTES, I'm always confused whether it means to
leave the quotes in or to take them out. Let's say "unquote", like we
say "cunescape".
This wraps a few common steps. It is defined as inline function instead of in a
.c file to avoid having a .c file. With a .c file, we would have three choices:
- either link it into libshared, but then then libshared would have to be
linked to libmount.
- or compile the .c file into each target separately. This has the disdvantage
that configuration of every target has to be updated and stuff will be compiled
multiple times anyway, which is not too different from keeping this in the
header file.
- or create a new convenience library just for this. This also has the disadvantage
that the every target would have to be updated, and a separate library for a
10 line function seems overkill.
By keeping everything in a header file, we compile this a few times, but
otherwise it's the least painful option. The compiler can optimize most of the
function away, because it knows if 'source' is set or not.
It seems better to use just a single parsing algorithm for /proc/self/mountinfo.
Also, unify the naming of variables in all places that use mnt_table_next_fs().
It makes it easier to compare the different call sites.
This is partially a refactoring, but also makes many more places use
unlocked operations implicitly, i.e. all users of fopen_temporary().
AFAICT, the uses are always for short-lived files which are not shared
externally, and are just used within the same context. Locking is not
necessary.
The function is otherwise generic enough to toggle other bind mount
flags beyond MS_RDONLY (for example: MS_NOSUID or MS_NODEV), hence let's
beef it up slightly to support that too.