Limit size of various tmpfs mounts to 10% of RAM, except volatile root and /var
to 25%. Another exception is made for /dev (also /devs for PrivateDevices) and
/sys/fs/cgroup since no (or very few) regular files are expected to be used.
In addition, since directories, symbolic links, device specials and xattrs are
not counted towards the size= limit, number of inodes is also limited
correspondingly: 4MB size translates to 1k of inodes (assuming 4k each), 10% of
RAM (using 16GB of RAM as baseline) translates to 400k and 25% to 1M inodes.
Because nr_inodes option can't use ratios like size option, there's an
unfortunate side effect that with small memory systems the limit may be on the
too large side. Also, on an extremely small device with only 256MB of RAM, 10%
of RAM for /run may not be enough for re-exec of PID1 because 16MB of free
space is required.
We might pin a home through authentication and a different one through a
session, all from the same PAM context, like sudo does. Hence also store
the referencing fd keyed by the user name.
Since acquiring user records involves plenty of IPC we try to cache user
records in the PAM context between our various hooks. Previously we'd
just cache whatever we acquired, and use it from the on, forever until
the context is destroyed.
This is problematic however, since some programs (notably sudo) use the
same PAM context for multiple different operations. Specifically, sudo
first authenticates the originating user before creating a session for
the destination user, all with the same PAM context. Thankfully, there
was a safety check for this case in place that re-validated that the
cached user record actually matched our current idea of the user to
operate on, but this just meant the hook would fail entirely.
Let's rework this: let's key the cache by the user name, so that we do
not confused by the changing of the user name during the context's
lifecycle and always, strictly use the cached user record of the user we
operate on.
Essentially this just means we now include the user name in the PAM data
field.
Secondly, this gets rid of the extra PAM data field that indicates
whether a user record is from homed or something else. To simplify
things we instead just cache the user record twice: once for consumption
by pam_systemd_home (which only wants homed records) and once shared by
pam_systemd and pam_systemd_home (and whoever else wants it). The cache
entries simply have different field names.
These arguments contain UserRecord structures serialized to JSON,
however only the "secret" part of it, not a whole user record. We do
this since the secret part is conceptually part of the user record and
in some contexts we need a user record in full with both secret and
non-secret part, and in others just the secret and in other just the
non-secret part, but we want to keep this in memory in the same logic.
Hence, let's rename the arguments where we expect a user record
consisting only of the secret part to "secret".
This also makes sure the control buffer is properly aligned. This
matters, as otherwise the control buffer might not be aligned and the
cmsg buffer counting might be off. The incorrect alignment is becoming
visible by using recvmsg_safe() as we suddenly notice the MSG_CTRUNC bit
set because of this.
That said, apparently this isn't enough to make this work on all
kernels. Since I couldn't figure this out, we now add 1K to the buffer
to be sure. We do this once already, also for a pktinfo structure
(though an IPv4/IPv6) one. I am puzzled by this, but this shouldn't
matter much. it works locally just fine, except for those ubuntu CI
kernels...
While we are at it, make some other changes too, to simplify and
modernize the function.
Previously pam_systemd_home.so was relying on `PAM_PROMPT_ECHO_OFF` to
display error messages to the user and also display the next prompt.
`PAM_PROMPT_ECHO_OFF` was never meant as a way to convey information to
the user, and following the example set in pam_unix.so you can see that
it's meant to _only_ display the prompt. Details about why the
authentication failed should be done in a `PAM_ERROR_MSG` before
displaying a short prompt as per usual using `PAM_PROMPT_ECHO_OFF`.
[127/1355] Compiling C object 'src/shared/5afaae1@@systemd-shared-245@sta/ethtool-util.c.o'
../src/shared/ethtool-util.c: In function ‘ethtool_get_permanent_macaddr’:
../src/shared/ethtool-util.c:260:60: warning: array subscript 5 is outside the bounds of an interior zero-length array ‘__u8[0]’ {aka ‘unsigned char[]’} [-Wzero-length-bounds]
260 | ret->ether_addr_octet[i] = epaddr.addr.data[i];
| ~~~~~~~~~~~~~~~~^~~
In file included from ../src/shared/ethtool-util.c:5:
../src/shared/linux/ethtool.h:704:7: note: while referencing ‘data’
704 | __u8 data[0];
| ^~~~
../src/shared/ethtool-util.c: In function ‘ethtool_set_features’:
../src/shared/ethtool-util.c:488:31: warning: array subscript 0 is outside the bounds of an interior zero-length array ‘__u32[0]’ {aka ‘unsigned int[]’} [-Wzero-length-bounds]
488 | len = buffer.info.data[0];
| ~~~~~~~~~~~~~~~~^~~
In file included from ../src/shared/ethtool-util.c:5:
../src/shared/linux/ethtool.h:631:8: note: while referencing ‘data’
631 | __u32 data[0];
| ^~~~
The kernel should not define the length of the array, but it does. We can't fix
that, so let's use a cast to avoid the warning.
For https://github.com/systemd/systemd/issues/6119#issuecomment-626073743.
v2:
- use #pragma instead of a cast. It seems the cast only works in some cases, and
gcc is "smart" enough to see beyond the cast. Unfortunately clang does not support
this warning, so we need to do a config check whether to try to suppress.
Indicates that the tags list cannot be modified by notify_message function.
Since the tags list is created only once for multiple call to
notify_message functions.
kernel 5.6 added support for a new flag for getrandom(): GRND_INSECURE.
If we set it we can get some random data out of the kernel random pool,
even if it is not yet initializated. This is great for us to initialize
hash table seeds and such, where it is OK if they are crap initially. We
used RDRAND for these cases so far, but RDRAND is only available on
newer CPUs and some archs. Let's now use GRND_INSECURE for these cases
as well, which means we won't needlessly delay boot anymore even on
archs/CPUs that do not have RDRAND.
Of course we never set this flag when generating crypto keys or uuids.
Which makes it different from RDRAND for us (and is the reason I think
we should keep explicit RDRAND support in): RDRAND we don't trust enough
for crypto keys. But we do trust it enough for UUIDs.
Let's make the logic a bit smarter: if we detect that /home is
encrypted, let's avoid double encryption and prefer plain
directory/subvolumes instead of our regular luks images.
Also, allow configuration go storage/file system via an env var passed
to homework. In a later commit, let's then change homed to initialize
that env var from a config file setting, when invoking homework.
Make use of the new user_record_build_image_path() helper the previous
commit added to share some code.
Also, let's make sure we update all parsed-out fields with the new data
from the binding, so that the parsed-out fields are definitely
up-to-date.
We should return 0 only if current freezer state, as reported by the
kernel, is already the desired state. Otherwise, we would dispatch
return dbus message prematurely in bus_unit_method_freezer_generic().
Thanks to Frantisek Sumsal for reporting the issue.
As described in #15603, it is a fairly common setup to use a fqdn as the
configured hostname. But it is often convenient to use just the actual
hostname, i.e. until the first dot. This adds support in tmpfiles, sysusers,
and unit files for %l which expands to that.
Fixes#15603.
Ideally, assert_cc() would be used for this, so that it is not possible to even
compile systemd with something like '-Dfallback-hostname=.foo'. But to do a
proper check we need to call hostname_is_valid(), and we cannot depend on being
able to run code (e.g. during cross-compilation). So let's do a very superficial
check in meson, and a proper on in test-util.
This new helper checks whether the specified locale is installed. It's
distinct from locale_is_valid() which just superficially checks if a
string looks like something that could be a valid locale.
Heavily inspired by @jsynacek's #13964.
Replaces: #13964
There are two libc APIs for accessing the user database: NSS/getpwuid(),
and fgetpwent(). if we run in --root= mode (i.e. "offline" mode), let's
use the latter. Otherwise the former. This means tmpfiles can use the
database included in the root environment for chowning, which is a lot
more appropriate.
Fixes: #14806
We make this entirely independent of the regular discard field, i.e. the
one that controls discard behaviour when the home directory is online.
Not all combinations make a ridiculous amount of sense, but most do.
Specifically:
online-discard = yes, offline-discard = yes
→ Discard when activating explicitly, and during runtime using
the "discard" mount option, and discard explicitly when logging
out again.
online-discard = no, offline-discard = yes
→ The new default: when logging in allocate the full backing
store, and use no discard while active. When loging out discard
everything. This provides nice behaviour: we take minimal storage
when offline but provide allocation guarantees while online.
online-discard = no, offline-discard = no
→ Never, ever discard, always operate with fully allocated
backing store. The extra safe mode.
Let's make debugging a bit easier: when invoking homed from the build
tree it's now possible to make sure homed invokes the build tree's
homework binary by setting an env var.
We always need to make them unions with a "struct cmsghdr" in them, so
that things properly aligned. Otherwise we might end up at an unaligned
address and the counting goes all wrong, possibly making the kernel
refuse our buffers.
Also, let's make sure we initialize the control buffers to zero when
sending, but leave them uninitialized when reading.
Both the alignment and the initialization thing is mentioned in the
cmsg(3) man page.
We need to use the CMSG_SPACE() macro to size the control buffers, not
CMSG_LEN(). The former is rounded up to next alignment boundary, the
latter is not. The former should be used for allocations, the latter for
encoding how much of it is actually initialized. See cmsg(3) man page
for details about this.
Given how confusing this is, I guess we don't have to be too ashamed
here, in most cases we actually did get this right.
Apparently unpriv clients expect to be able to auth via PAM. Kinda
sucks. But it is what it is. Hence open this up.
This shouldn't be too bad in effect since clients after all need to
provide security creds for unlocking the home dir, in order to misuse
this.
Fixes: #15072
Our hashmap and set helpers return a different code whenever an entry
already exists, so let's use this to avoid unsetting scan_uptodate when
not necessary.
Thus, the return convention for
sd_device_enumerator_add_match_subsystem,
sd_device_enumerator_add_match_sysattr,
sd_device_enumerator_add_match_property,
sd_device_enumerator_add_match_sysname,
sd_device_enumerator_add_match_tag,
device_enumerator_add_match_parent_incremental,
sd_device_enumerator_add_match_parent,
sd_device_enumerator_allow_uninitialized,
device_enumerator_add_match_is_initialized
is that "1" is returned if action was taken, and "0" on noop.
The sets are such basic functionality that it is convenient to be able to
build test-set without all the machinery in shared, and to test it without
the mempool to validate memory accesses easier.
If we're using a set with _put_strdup(), most of the time we want to use
string hash ops on the set, and free the strings when done. This defines
the appropriate a new string_hash_ops_free structure to automatically free
the keys when removing the set, and makes set_put_strdup() and set_put_strdupv()
instantiate the set with those hash ops.
hashmap_put_strdup() was already doing something similar.
(It is OK to instantiate the set earlier, possibly with a different hash ops
structure. set_put_strdup() will then use the existing set. It is also OK
to call set_free_free() instead of set_free() on a set with
string_hash_ops_free, the effect is the same, we're just overriding the
override of the cleanup function.)
No functional change intended.
Let's create a string cell for the unit if possible (since there can
only be one unit right now, and the JSON alternative output then
generates a string instead of an array for us), an empty cell if empty.
This adds the --exit-idle-time argument that causes
systemd-socket-proxyd to exit when there has been an idle period. An
open connection prevents the idle period from starting, even if there is
no activity on that connection.
When combined with another service that uses StopWhenUnneeded=, the
proxy exiting can trigger a resource-intensive process to exit. So
although the proxy may consume minimal resources, significant resources
can be saved indirectly.
Fixes#2106
Let's use "!*" instead of "!!" as invalid password string.
Generally, any invalid password string can be used to for locking an
account, according to shadow(5). To temporarily lock a password of an
account it is commonly implemented to prefix the original password with
a single "!", so that it can later on be unlocked again by removing the
"!", restoring the original password. Thus, the "!" marker is an
indicator for a locked password; the act of prefixing "!" to a
password string is the locking operation; and the removal of a "!"
prefix is the unlock operation. (This is also suggested in shadow(5)).
If we want to entirely lock an account we previously used "!!" as
password string. This is nice since it indicates the password is locked.
However, it is less than ideal, since applying the password unlock
operation once will change the string to "!", which is still a locked
password. Unlocking the password a second time will result in "", i.e.
the empty password, which will in many cases allow logging in without
password. And that's a problem. Hopefully, tools do not allow such
duplicate unlocking, but it's still not a nice property.
By changing our password string to "!*" we get different behaviour: the
password will appear locked. When it is unlocked the password is "*"
which is an invalid password. In that case the password is hence
unlocked but invalid, which is a much better state to be in than the
above.
This is paranoia hardening. Not more. There's no report that anyone
every unlocked an account twice and people could log in.
In all the other cases, I think the code was clearer with the static table.
Here, not so much. And because of the existing dump code, the vtables cannot
be made static and need to remain exported. I still think it's worth to do the
change to have the cmdline introspection, but I'm disappointed with how this
came out.
The idea is to have a static table that defines the dbus API. The vtable is
defined right next to the interface name and path because they are logically
connected.
For units which are aliases of other units, reporting preset status as
"enabled" is rather misleading. For example, dbus.service is an alias of
dbus-broker.service. In list-unit-files we'd show both as "enabled". In
particular, systemctl preset ignores aliases, so showing any preset status at
all is always going to be misleading. Let's introduce a new state "alias" and
use that for all aliases.
I was trying to avoid adding a new state, to keep compatibility with previous
behaviour, but for alias unit files it simply doesn't seem very useful to show
any of the existing states. It seems that the clearly showing that those are
aliases for other units will be easiest to understand for users.
When doing list-unit-files with --root, we would re-read the preset
list for every unit. This uses a cache to only do it once. The time
for list-unit-files goes down by about ~30%.
unit_file_query_preset() is also called from src/core/. This patch does not
touch that path, since the saving there are smaller, since preset status is
only read on demand over dbus, and caching would be more complicated.
It is super confusing when a command does not support --root, and is called
with it specified, and returns some bogus results. Let's just catch this
early and refuse.
Consider such configuration:
$ systemd-nspawn --read-only --timezone=copy --resolv-conf=copy-host \
--overlay="+/etc::/etc" <...>
Assuming one wants `/` to be read-only, DNS and `/etc/localtime` to
work. One way to do it is to create an overlay filesystem in `/etc/`.
However, systemd-nspawn tries to create `/etc/resolv.conf` and
`/etc/localtime` before mounting the custom paths, while `/` (and, by
extension, `/etc`) is read-only. Thus it fails to create those files.
Mounting custom paths before modifying anything in `/etc/` makes this
possible.
Full example:
```
$ debootstrap buster /var/lib/machines/t1 http://deb.debian.org/debian
$ systemd-nspawn --private-users=false --timezone=copy --resolv-conf=copy-host --read-only --tmpfs=/var --tmpfs=/run --overlay="+/etc::/etc" -D /var/lib/machines/t1 ping -c 1 example.com
Spawning container t1 on /var/lib/machines/t1.
Press ^] three times within 1s to kill container.
ping: example.com: Temporary failure in name resolution
Container t1 failed with error code 130.
```
With the patch:
```
$ sudo ./build/systemd-nspawn --private-users=false --timezone=copy --resolv-conf=copy-host --read-only --tmpfs=/var --tmpfs=/run --overlay="+/etc::/etc" -D /var/lib/machines/t1 ping -qc 1 example.com
Spawning container t1 on /var/lib/machines/t1.
Press ^] three times within 1s to kill container.
PING example.com (93.184.216.34) 56(84) bytes of data.
--- example.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 110.912/110.912/110.912/0.000 ms
Container t1 exited successfully.
```
This changes the calendarspec parser to allow expressions such as
"00:05..05", i.e. a range where start and end is the same. It also
allows expressions such as "00:1-2/3", i.e. where the repetition value
does not fit even once in the specified range. With this patch both
cases will now be optimized away, i.e. the range is removed and a fixed
value is used, which is functionally equivalent.
See #15030 for an issue where the inability to parse such expressions
caused confusion.
I think it's probably better to accept these gracefully and optimizing
them away instead of refusing them with a plain EINVAL. With a tool such
as "systemd-analyze" calendar it should be easy to figure out the
normalized form with the redundant bits optimized away.
Let's try to mangle table contents a bit to make them more suitable as
JSON field names. Specifically when we see "foo bar" convert this to
"foo_bar" as field name, as variable/field names are generally assumed
to be without spaces.
HUGE_SIZE was defined inconsistently.
> In file included from ../src/basic/alloc-util.h:9,
> from ../src/journal/test-compress.c:9:
> ../src/journal/test-compress.c: In function ‘main’:
> ../src/journal/test-compress.c:280:33: error: ‘HUGE_SIZE’ undeclared (first use in this function)
> 280 | assert_se(huge = malloc(HUGE_SIZE));
Systems where a mount point is expected to be read-write needs a way to
fail mount units that fallback as read-only.
Add a property to allow setting the -w option when calling mount(8).
This adds the sd_notify_barrier function, to allow users to synchronize against
the reception of sd_notify(3) status messages. It acts as a synchronization
point, and a successful return gurantees that all previous messages have been
consumed by the manager. This can be used to eliminate race conditions where
the sending process exits too early for systemd to associate its PID to a
cgroup and attribute the status message to a unit correctly.
systemd-notify now uses this function for proper notification delivery and be
useful for NotifyAccess=all units again in user mode, or in cases where it
doesn't have a control process as parent.
Fixes: #2739
A service can specify FDSTORE=1 FDPOLL=0 to request that PID1 does not
poll the fd to remove them on error. If set, fds will only be removed on
FDSTOREREMOVE=1 or when the service is done.
Fixes: #12086
With cgroup v2 the cgroup freezer is implemented as a cgroup
attribute called cgroup.freeze. cgroup can be frozen by writing "1"
to the file and kernel will send us a notification through
"cgroup.events" after the operation is finished and processes in the
cgroup entered quiescent state, i.e. they are not scheduled to
run. Writing "0" to the attribute file does the inverse and process
execution is resumed.
This commit exposes above low-level functionality through systemd's DBus
API. Each unit type must provide specialized implementation for these
methods, otherwise, we return an error. So far only service, scope, and
slice unit types provide the support. It is possible to check if a
given unit has the support using CanFreeze() DBus property.
Note that DBus API has a synchronous behavior and we dispatch the reply
to freeze/thaw requests only after the kernel has notified us that
requested operation was completed.
$ systemctl --no-pager --root /tmp/root2/ cat ctrl-alt-del.target
Failed to resolve symlink /tmp/root2/etc/systemd/system/ctrl-alt-del.target pointing to /usr/lib/systemd/system/reboot.target, ignoring: Channel number out of range
...
EFI variable access is nowadays subject to rate limiting by the kernel.
Thus, let's cache the results of checking them, in order to minimize how
often we access them.
Fixes: #14828
Callers of cg_get_keyed_attribute_full() can now specify via the flag whether the
missing keyes in cgroup attribute file are OK or not. Also the wrappers for both
strict and graceful version are provided.
This has the advantage that mac_selinux_access_check() can be used as a
function in all contexts. For example, parameters passed to it won't be
reported as unused if the "function" call is replaced with 0 on SELinux
disabled builds.
if we parse an xattr line that has no valid assignment, we might end up
with an empty ->xattr list. Don't hit assert on that, just go on.
Fixes: #15610
Let's allow more memory to be locked on beefy machines than on small
ones. The previous limit of 64M is the lower bound still. This
effectively means on a 4GB machine we can lock 512M, which should be
more than enough, but still not lock up the machine entirely under
pressure.
Fixes: #15053
WSL2 will soon (TM) include the "WSL2" string in /proc/sys/kernel/osrelease
so the workaround will no longer be necessary.
We have several different cloud images which do include the "microsoft"
string already, which would break this detection. They are for internal
usage at the moment, but the userspace side can come from all over the
place so it would be quite hard to track and downstream-patch to avoid
breakages.
This reverts commit a2f838d590.
On my laptop (Lenovo X1carbo 4th) I very occasionally see test-boot-timestamps
fail with this tb:
262/494 test-boot-timestamps FAIL 0.7348453998565674 s (killed by signal 6 SIGABRT)
08:12:48 SYSTEMD_LANGUAGE_FALLBACK_MAP='/home/zbyszek/src/systemd/src/locale/language-fallback-map' SYSTEMD_KBD_MODEL_MAP='/home/zbyszek/src/systemd/src/locale/kbd-model-map' PATH='/home/zbyszek/src/systemd/build:/home/zbyszek/.local/bin:/usr/lib64/qt-3.3/bin:/usr/share/Modules/bin:/usr/condabin:/usr/lib64/ccache:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/home/zbyszek/bin:/var/lib/snapd/snap/bin' /home/zbyszek/src/systemd/build/test-boot-timestamps
--- stderr ---
Failed to read $container of PID 1, ignoring: Permission denied
Found container virtualization none.
Failed to get SystemdOptions EFI variable, ignoring: Interrupted system call
Failed to read ACPI FPDT: Permission denied
Failed to read LoaderTimeInitUSec: Interrupted system call
Failed to read EFI loader data: Interrupted system call
Assertion 'q >= 0' failed at src/test/test-boot-timestamps.c:84, function main(). Aborting.
Normally it takes ~0.02s, but here there's a slowdown to 0.73 and things fail with EINTR.
This happens only occasionally, and I haven't been able to capture a strace.
It would be to ignore that case in test-boot-timestamps or always translate
EINTR to -ENODATA. Nevertheless, I think it's better to retry, since this gives
as more resilient behaviour and avoids a transient failure.
See
https://github.com/torvalds/linux/blob/master/fs/efivarfs/file.c#L75
and
bef3efbeb8.
Don't assume that 4MB can be allocated from stack since there could be smaller
DefaultLimitSTACK= in force, so let's use malloc(). NUL terminate the huge
strings by hand, also ensure termination in test_lz4_decompress_partial() and
optimize the memset() for the string.
Some items in /proc and /etc may not be accessible to poor unprivileged users
due to e.g. SELinux, BOFH or both, so check for EACCES and EPERM.
/var/tmp may be a symlink to /tmp and then path_compare() will always fail, so
let's stick to /tmp like elsewhere.
/tmp may be mounted with noexec option and then trying to execute scripts from
there would fail.
Detect and warn if seccomp is already in use, which could make seccomp test
fail if the syscalls are already blocked.
Unset $TMPDIR so it will not break specifier tests where %T is assumed to be
/tmp and %V /var/tmp.
It's not always mounted, e.g. during the build-time tests, it's running inside
a chroot (that's how Debian/Ubuntu build packages, in chroots) so this test
always fails because /sys/fs/cgroup isn't mounted.
When nothing at all is mounted at /sys/fs/cgroup, the fs.f_type is
SYSFS_MAGIC (0x62656572) which results in the confusing debug log:
"Unknown filesystem type 62656572 mounted on /sys/fs/cgroup."
Instead, if the f_type is SYSFS_MAGIC, a more accurate message is:
"No filesystem is currently mounted on /sys/fs/cgroup."
Split out of #15457, let's see if this is the culprit of the CI failure.
(also setting green label here, since @keszybz already greenlit it in that other PR)
We unregister binfmt_misc twice during shutdown with this change:
1. A previous commit added support for doing that in the final shutdown
phase, i.e. when we do the aggressive umount loop. This is the robust
thing to do, in case the earlier ("clean") shutdown phase didn't work
for some reason.
2. This commit adds support for doing that when systemd-binfmt.service
is stopped. This is a good idea so that people can order mounts
before the service if they want to register binaries from such
mounts, as in that case we'll undo the registration on shutdown
again, before unmounting those mounts.
And all that, just because of that weird "F" flag the kernel introduced
that can pin files...
Fixes: #14981
Let's just copy out the bit of the string we need, and let's make sure
we refuse rules called "status" and "register", since those are special
files in binfmt_misc's file system.
Apparently if the new "F" flag is used they might pin files, which
blocks us from unmounting things. Let's hence clear this up explicitly.
Before entering our umount loop.
Fixes: #14981
let's return ENOSYS in that case, to make things a bit less confusng.
Previously we'd just propagate ENOENT, which people might mistake as
applying to the object being modified rather than /proc/ just not being
there.
Let's return ENOSYS instead, i.e. an error clearly indicating that some
kernel API is not available. This hopefully should put people on a
better track.
Note that we only do the procfs check in the error path, which hopefully
means it's the less likely path.
We probably can add similar bits to more suitable codepaths dealing with
/proc/self/fd, but for now, let's pick to the ones noticed in #14745.
Fixes: #14745
Our journal code is generally supposed to be written in a fashion that
the underlying file can be deallocated any time, i.e. our mmap of it
suddenly becomes all zeroes. The idea is that we catch that when parsing
everything. For that to work safely we need to make sure that when doing
arithmetics or comparisons on values read from the map we don't run into
TTOCTTOU issues when determining validity. Hence we need to copy out the
values before use and operate on the copies. This requires some special
care since the C compiler could suppress our copies as optimization.
Hence use the new READ_NOW() macro to force a copy by using memcpy(),
and use it whenever we start doing an arithmetic operation on it, or
validity checking of multiple steps.
Fixes: #14943
Mappings canbe replaced by all zeroes under our feet if vacuuming
decides to unallocate some file. Hence let's not check for this kind of
stuff in an assert.
(Typically, we should genreate runtime errors in this case, in
particular EBADMSG, which the callers generally look for. But in this
case this is just an extra precaution check anyway, so let's just remove
it.)
When accessing journal files we generally are fine when values change
beneath our feet, while we are looking at them, as long as they change
from something valid to zero. This is required since we nowadays
forcibly unallocate journal files on vacuuming, to ensure they are
actually released.
However, we need to make sure that the validity checks we enforce are
done on suitable copies of the fields in the file. Thus provide a macro
that forces a copy, and disallows the compiler from merging our copy
with the actually memory where it is from.
Otherwise we'd not read the services input while waiting for the job to
wait, and there's no point in waiting for the job anyway if we wait for
the unit to stop ultimately.
Fixes: #15395