If we don't succeed on the first try it's because another process is
opening the same device. Do a microsleep for 2ms to increase the
chances it has completed the next time around the loop.
The symlink /dev/mapper/dm_name is created by udev after a mapper
device is set up. So libdevmapper/libcrypsetup might tell us that
a verity device exists, but the symlink we use as the source for
the mount operation might not be there yet.
Instead of falling back to a new unique device set up, wait for
the udev event matching on the expected devlink for at least 100ms
(after which the benefits of sharing a device in terms of setup
time start to disappear - on my production machines, opening a new
verity device seems to take between 150ms and 300ms)
This effectively makes little difference because we exit soon later
anyway, which will close the fds, too. However, it's still useful since
it means the parent will get EOF events on them in the order we process
things and isn't delayed to process the data from the pipes until the
child dies.
LOOP_CONFIGURE allows us to configure a loopback device in one ioctl
instead of two, which is not just faster but also removes the race that
udev might start probing the device before we adjusted things properly.
Unfortunately LOOP_CONFIGURE is broken in regards to LO_FLAGS_PARTSCAN
as of kernel 5.8.0. This patch contains a work-around for that, to
fallback to old behaviour if partition scanning is requested but does
not work. Sucks a bit.
Proposed upstream fix for that issue:
https://lkml.org/lkml/2020/8/6/97
Let's fix up invalid GECOS fields both when we convert from NSS to JSON
and the other way round.
Kinda sucks we have to do that, but NSS does it when writing data to
/etc/passwd, so let's do the same.
Fixes: #16668
User records have the realname/gecos fields, groups never had that, but
it would really be useful to have it, hence let's add it with similar
semantics.
We enforce the same syntax as for GECOS, since it's better to start with
strict rules and losen them later instead of the opposite.
Follows the same pattern and features as RootImage, but allows an
arbitrary mount point under / to be specified by the user, and
multiple values - like BindPaths.
Original implementation by @topimiettinen at:
https://github.com/systemd/systemd/pull/14451
Reworked to use dissect's logic instead of bare libmount() calls
and other review comments.
Thanks Topi for the initial work to come up with and implement
this useful feature.
Apparently, if IO is still in flight at the moment we invoke LOOP_CLR_FD
it is likely simply dropped (probably because yanking physical storage,
such as a USB stick would drop it too). Let's protect ourselves against
that and always sync explicitly before we invoke it.
The explicit limit is dropped, which means that we return to the kernel default
of 50% of RAM. See 362a55fc14 for a discussion why that is not as much as it
seems. It turns out various applications need more space in /dev/shm and we
would break them by imposing a low limit.
While at it, rename the define and use a single macro for various tmpfs mounts.
We don't really care what the purpose of the given tmpfs is, so it seems
reasonable to use a single macro.
This effectively reverts part of 7d85383edb. Fixes#16617.
Allows to specify mount options for RootImage.
In case of multi-partition images, the partition number can be prefixed
followed by colon. Eg:
RootImageOptions=1:ro,dev 2:nosuid nodev
In absence of a partition number, 0 is assumed.
This should be enough to fix https://bugzilla.redhat.com/show_bug.cgi?id=1856514.
But the limit should be significantly higher than 10% anyway. By setting a
limit on /tmp at 10% we'll break many reasonable use cases, even though the
machine would deal fine with a much larger fraction devoted to /tmp.
(In the first version of this patch I made it 25% with the comment that
"Even 25% might be too low.". The kernel default is 50%, and we have been using
that seemingly without trouble since https://fedoraproject.org/wiki/Features/tmp-on-tmpfs.
So let's just make it 50% again.)
See 7d85383edb.
(Another consideration is that we learned from from the whole initiative with
zram in Fedora that a reasonable size for zram is 0.5-1.5 of RAM, and that pretty
much all systems benefit from having zram or zswap enabled. Thus it is reasonable
to assume that it'll become widely used. Taking the usual compression effectiveness
of 0.2 into account, machines have effective memory available of between
1.0 - 0.2*0.5 + 0.5 = 1.4 (for zram sized to 0.5 of RAM) and
1.0 - 0.2*1.5 + 1.5 = 2.2 (for zram 1.5 sized to 1.5 of RAM) times RAM size.
This means that the 10% was really like 7-4% of effective memory.)
A comment is added to mount-util.h to clarify that tmp.mount is separate.
Opening a verity device is an expensive operation. The kernelspace operations
are mostly sequential with a global lock held regardless of which device
is being opened. In userspace jumps in and out of multiple libraries are
required. When signatures are used, there's the additional cryptographic
checks.
We know when two devices are identical: they have the same root hash.
If libcrypsetup returns EEXIST, double check that the hashes are really
the same, and that either both or none have a signature, and if everything
matches simply remount the already open device. The kernel will do
reference counting for us.
In order to quickly and reliably discover if a device is already open,
change the node naming scheme from '/dev/mapper/major:minor-verity' to
'/dev/mapper/$roothash-verity'.
Unfortunately libdevmapper is not 100% reliable, so in some case it
will say that the device already exists and it is active, but in
reality it is not usable. Fallback to an individually-activated
unique device name in those cases for robustness.
This changes the code to allow looking at multiple files with different
prefixes, but uses "/etc" and "/usr/lib". rpm-ostree uses
/usr/lib/{passwd,group} with nss-altfiles. I see no harm in simply trying both
paths on all systems.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1857530.
A minor memory leak is fixed: hashmap_put() returns -EEXIST is the key is
present *and* and the value is different. It return 0 if the value is the
same. Thus, we would leak the user/group name if it was specified multiple
times with the same uid/gid. I opted to remove the warning message completely:
with multiple files it is reasonable to have the same name defined more than
once. But even with one file the warning is dubious: all tools that read those
files deal correctly with duplicate entries and we are not writing a linter.
The test binary has two modes: in the default argument-less mode, it
just checks that "root" can be resolved. When invoked manually, a root
prefix and user/group names can be specified.