Several recent failed runs show that the test is still racy in two ways:
1) Sometimes it takes a while before the PID file is created, leading
to:
```
[ 10.950540] testsuite-47.sh[308]: ++ cat /leakedtestpid
[ 10.959712] testsuite-47.sh[308]: cat: /leakedtestpid: No such file or directory
[ 10.959824] testsuite-47.sh[298]: + leaked_pid=
```
2) Again, sometimes we check the leaked PID before the unit is actually
stopped, leading to a false negative:
```
[ 18.099599] testsuite-47.sh[346]: ++ cat /leakedtestpid
[ 18.116462] testsuite-47.sh[333]: + leaked_pid=342
[ 18.117101] testsuite-47.sh[333]: + systemctl stop testsuite-47-repro
...
[ 20.033907] testsuite-47.sh[333]: + ps -p 342
[ 20.080050] testsuite-47.sh[351]: PID TTY TIME CMD
[ 20.080050] testsuite-47.sh[351]: 342 ? 00:00:00 sleep
[ 20.082040] testsuite-47.sh[333]: + exit 42
```
When a command asks to load a unit directly and it is in state
UNIT_NOT_FOUND, and the cache is outdated, we refresh it and
attempto to load again.
Use the same logic when building up a transaction and a dependency in
UNIT_NOT_FOUND state is encountered.
Update the unit test to exercise this code path.
SIG-prefixed signals for `kill` are not POSIX compliant, so on Ubuntu CI
(which defaults to dash instead of bash) the TEST-52 contains following
error:
[ 9693.549638] sh[51]: + systemctl poweroff --no-block
[ 9693.553130] systemd-logind[26]: System is powering down.
[ 9693.608911] sh[54]: /bin/sh: 1: kill: Illegal option -S
This can be reproduced manually as well, either by running dash, or bash
in POSIX mode:
$ dash -c 'kill -SIGKILL 123'
dash: 1: kill: Illegal option -S
$ bash --posix -c 'kill -SIGKILL 123'
bash: line 0: kill: SIGKILL: invalid signal specification
When the system is under heavy load, it can happen that the unit cache
is refreshed for an unrelated reason (in the test I simulate this by
attempting to start a non-existing unit). The new unit is found and
accounted for in the cache, but it's ignored since we are loading
something else.
When we actually look for it, by attempting to start it, the cache is
up to date so no refresh happens, and starting fails although we have
it loaded in the cache.
When the unit state is set to UNIT_NOT_FOUND, mark the timestamp in
u->fragment_loadtime. Then when attempting to load again we can check
both if the cache itself needs a refresh, OR if it was refreshed AFTER
the last failed attempt that resulted in the state being
UNIT_NOT_FOUND.
Update the test so that this issue reproduces more often.
Prompted by systemd/systemd#16111.
* check if /var is a mountpoint - if not, something went wrong. In case
of systemd/systemd#16111 the /failed file was created, because
systemd-cryptsetup failed, but it ended up being empty, making the result
check incorrectly pass
* forward journal messages to console - if we fail to mount /var,
journald won't flush logs to the persistent storage and we end up
empty handed and with no clue what went wrong
For example, without systemd/systemd#16111 and with this patch:
...
[FAILED] Failed to start systemd-cryptsetup@varcrypt.service.
See 'systemctl status systemd-cryptsetup@varcrypt.service' for details.
[DEPEND] Dependency failed for cryptsetup.target.
...
[ 3.882451] systemd-cryptsetup[581]: Key file /etc/varkey is world-readable. This is not a good idea!
[ 3.883946] systemd-cryptsetup[581]: WARNING: Locking directory /run/cryptsetup is missing!
[ 3.884846] systemd-cryptsetup[581]: Failed to load Bitlocker superblock on device /dev/disk/by-uuid/180ba5ef-873b-4018-9968-47c23431f71a: Invalid argument
...
[ 4.099451] sh[606]: + mountpoint /var
[ 4.100025] sh[603]: + systemctl poweroff --no-block
[ 4.101636] systemd[1]: Finished systemd-user-sessions.service.
[ 4.102598] sh[608]: /var is not a mountpoint
[FAILED] Failed to start testsuite-02.service.
dm-verity support in dissect-image at the moment is restricted to GPT
volumes.
If the image a single-filesystem type without a partition table (eg: squashfs)
and a roothash/verity file are passed, set the verity flag and mark as
read-only.
Note that run_test() calls coredumpctl in a loop because in certain
environments (1 vCPU unaccelerated QEMU VM) it might take quite a
while to process the coredump.
The time-based cache allows starting a new unit without an expensive
daemon-reload, unless there was already a reference to it because of
a dependency or ordering from another unit.
If the cache is out of date, check again if we can load the
fragment.
Check:
- There is only 3 messages logged with type stdout
- Check all messages logged does not have new line: LINE_BREAK=eof
- Check that the 3 messages are logged from a different PID
- Check the 3 MESSAGE= content
The disk attributes can take some time to update on certain filesystems,
so let's strip them from inputs of both `homectl` and `userdbctl` before
comparing them to avoid unexpected fails.
Also, switch from `cmp` to `diff` to make a potential test fail a bit more
debuggable.
Fixes: #14755
Give systemd a chance to process the stop event before checking if the
PID has indeed leaked. This should fix the intermittent test fails in CI
even with a fixed systemd version, like this one:
```
Apr 08 10:22:09 testsuite-47.sh[345]: ++ cat /leakedtestpid
Apr 08 10:22:09 testsuite-47.sh[334]: + leaked_pid=342
Apr 08 10:22:09 testsuite-47.sh[334]: + systemctl stop testsuite-47-repro
Apr 08 10:22:10 testsuite-47.sh[334]: + ps -p 342
Apr 08 10:22:10 testsuite-47.sh[348]: PID TTY TIME CMD
Apr 08 10:22:10 testsuite-47.sh[348]: 342 ? 00:00:00 sleep
Apr 08 10:22:10 testsuite-47.sh[334]: + exit 42
```
Followup to 197298ff9f
The test would fail when run again from the same image. So let's
rename the stuff we create to be more unique, and remove it before
running the test. (Removing it after would be more elegant, but it's
hard to make sure that everything is removed when things fail halfway.
Cleanup *before* tests is much more rebust.)
The two timezone files are now installed in the global setup. I am not too
happy about this, but it still seems better than to create a completely
separate image just for this.