When in a userns environment we cannot take away per-mount point flags
set on a mount point that was passed to us. Hence we need to be careful
to always check the actual mount flags in place and manipulate only
those flags of them that we actually want to change and not reset more
as side-effect.
We mostly got this right already in
bind_remount_recursive_with_mountinfo(), but didn't in the simpler
bind_remount_one_with_mountinfo(). Catch up.
(The old code assumed that the MountEntry.flags field contained the
right flag settings, but it actually doesn't for new mounts we just
established as for those mount() establishes the initial flags for us,
and we have to read them back to figure out which ones the kernel
picked.)
Fixes: #13622
This cleans up the function in multiple ways:
- change order of parameters to follow our usualy system of putting
return parameters last
- rename return parameter "ret" as we usually do
- don't initialize local variables we override immediately anyway
- downgrade log messages to LOG_DEBUG (since we don't log about any
other errors here above LOG_DEBUG, as this is mostly an "API"-style
function)
- handle that mnt_fs_get_vfs_options() may return NULL (according to
docs)
- manually map the ST_xyz to MS_xyz flags on statvfs(), because while
they are mostly the same, they aren't entirely the same, MS_RELATIME and
ST_RELATIME are defined differently (sad!)
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
The commit
"util: Do not clear parent mount flags when setting up namespaces"
introduced a statvfs call read the flags of the original mount
and have them applied to the bind mount.
This has two problems:
(1) The mount flags returned by statvfs(2) do not match the flags
accepted by mount(2). For example, the value 4096 means ST_RELATIME
when returned by statvfs(2), but means MS_BIND when passed to mount(2).
(2) A call to statvfs blocks indefinitely when ran against a disconnected
network drive ( https://github.com/systemd/systemd/issues/12667 ).
We already use libmount to parse `/proc/self/mountinfo` but did not use the
mount flag information from there. This patch changes that to use the mount
flags parsed by libmount instead of calling statvfs. Only if getting the
flags through libmount fails we call statvfs.
Fixes https://github.com/systemd/systemd/issues/12667
Whenever I see EXTRACT_QUOTES, I'm always confused whether it means to
leave the quotes in or to take them out. Let's say "unquote", like we
say "cunescape".
This wraps a few common steps. It is defined as inline function instead of in a
.c file to avoid having a .c file. With a .c file, we would have three choices:
- either link it into libshared, but then then libshared would have to be
linked to libmount.
- or compile the .c file into each target separately. This has the disdvantage
that configuration of every target has to be updated and stuff will be compiled
multiple times anyway, which is not too different from keeping this in the
header file.
- or create a new convenience library just for this. This also has the disadvantage
that the every target would have to be updated, and a separate library for a
10 line function seems overkill.
By keeping everything in a header file, we compile this a few times, but
otherwise it's the least painful option. The compiler can optimize most of the
function away, because it knows if 'source' is set or not.
It seems better to use just a single parsing algorithm for /proc/self/mountinfo.
Also, unify the naming of variables in all places that use mnt_table_next_fs().
It makes it easier to compare the different call sites.
This is partially a refactoring, but also makes many more places use
unlocked operations implicitly, i.e. all users of fopen_temporary().
AFAICT, the uses are always for short-lived files which are not shared
externally, and are just used within the same context. Locking is not
necessary.
The function is otherwise generic enough to toggle other bind mount
flags beyond MS_RDONLY (for example: MS_NOSUID or MS_NODEV), hence let's
beef it up slightly to support that too.