When doing list-unit-files with --root, we would re-read the preset
list for every unit. This uses a cache to only do it once. The time
for list-unit-files goes down by about ~30%.
unit_file_query_preset() is also called from src/core/. This patch does not
touch that path, since the saving there are smaller, since preset status is
only read on demand over dbus, and caching would be more complicated.
It is super confusing when a command does not support --root, and is called
with it specified, and returns some bogus results. Let's just catch this
early and refuse.
Consider such configuration:
$ systemd-nspawn --read-only --timezone=copy --resolv-conf=copy-host \
--overlay="+/etc::/etc" <...>
Assuming one wants `/` to be read-only, DNS and `/etc/localtime` to
work. One way to do it is to create an overlay filesystem in `/etc/`.
However, systemd-nspawn tries to create `/etc/resolv.conf` and
`/etc/localtime` before mounting the custom paths, while `/` (and, by
extension, `/etc`) is read-only. Thus it fails to create those files.
Mounting custom paths before modifying anything in `/etc/` makes this
possible.
Full example:
```
$ debootstrap buster /var/lib/machines/t1 http://deb.debian.org/debian
$ systemd-nspawn --private-users=false --timezone=copy --resolv-conf=copy-host --read-only --tmpfs=/var --tmpfs=/run --overlay="+/etc::/etc" -D /var/lib/machines/t1 ping -c 1 example.com
Spawning container t1 on /var/lib/machines/t1.
Press ^] three times within 1s to kill container.
ping: example.com: Temporary failure in name resolution
Container t1 failed with error code 130.
```
With the patch:
```
$ sudo ./build/systemd-nspawn --private-users=false --timezone=copy --resolv-conf=copy-host --read-only --tmpfs=/var --tmpfs=/run --overlay="+/etc::/etc" -D /var/lib/machines/t1 ping -qc 1 example.com
Spawning container t1 on /var/lib/machines/t1.
Press ^] three times within 1s to kill container.
PING example.com (93.184.216.34) 56(84) bytes of data.
--- example.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 110.912/110.912/110.912/0.000 ms
Container t1 exited successfully.
```
This changes the calendarspec parser to allow expressions such as
"00:05..05", i.e. a range where start and end is the same. It also
allows expressions such as "00:1-2/3", i.e. where the repetition value
does not fit even once in the specified range. With this patch both
cases will now be optimized away, i.e. the range is removed and a fixed
value is used, which is functionally equivalent.
See #15030 for an issue where the inability to parse such expressions
caused confusion.
I think it's probably better to accept these gracefully and optimizing
them away instead of refusing them with a plain EINVAL. With a tool such
as "systemd-analyze" calendar it should be easy to figure out the
normalized form with the redundant bits optimized away.
Let's try to mangle table contents a bit to make them more suitable as
JSON field names. Specifically when we see "foo bar" convert this to
"foo_bar" as field name, as variable/field names are generally assumed
to be without spaces.
HUGE_SIZE was defined inconsistently.
> In file included from ../src/basic/alloc-util.h:9,
> from ../src/journal/test-compress.c:9:
> ../src/journal/test-compress.c: In function ‘main’:
> ../src/journal/test-compress.c:280:33: error: ‘HUGE_SIZE’ undeclared (first use in this function)
> 280 | assert_se(huge = malloc(HUGE_SIZE));
This adds the sd_notify_barrier function, to allow users to synchronize against
the reception of sd_notify(3) status messages. It acts as a synchronization
point, and a successful return gurantees that all previous messages have been
consumed by the manager. This can be used to eliminate race conditions where
the sending process exits too early for systemd to associate its PID to a
cgroup and attribute the status message to a unit correctly.
systemd-notify now uses this function for proper notification delivery and be
useful for NotifyAccess=all units again in user mode, or in cases where it
doesn't have a control process as parent.
Fixes: #2739
A service can specify FDSTORE=1 FDPOLL=0 to request that PID1 does not
poll the fd to remove them on error. If set, fds will only be removed on
FDSTOREREMOVE=1 or when the service is done.
Fixes: #12086
With cgroup v2 the cgroup freezer is implemented as a cgroup
attribute called cgroup.freeze. cgroup can be frozen by writing "1"
to the file and kernel will send us a notification through
"cgroup.events" after the operation is finished and processes in the
cgroup entered quiescent state, i.e. they are not scheduled to
run. Writing "0" to the attribute file does the inverse and process
execution is resumed.
This commit exposes above low-level functionality through systemd's DBus
API. Each unit type must provide specialized implementation for these
methods, otherwise, we return an error. So far only service, scope, and
slice unit types provide the support. It is possible to check if a
given unit has the support using CanFreeze() DBus property.
Note that DBus API has a synchronous behavior and we dispatch the reply
to freeze/thaw requests only after the kernel has notified us that
requested operation was completed.
$ systemctl --no-pager --root /tmp/root2/ cat ctrl-alt-del.target
Failed to resolve symlink /tmp/root2/etc/systemd/system/ctrl-alt-del.target pointing to /usr/lib/systemd/system/reboot.target, ignoring: Channel number out of range
...
EFI variable access is nowadays subject to rate limiting by the kernel.
Thus, let's cache the results of checking them, in order to minimize how
often we access them.
Fixes: #14828
Callers of cg_get_keyed_attribute_full() can now specify via the flag whether the
missing keyes in cgroup attribute file are OK or not. Also the wrappers for both
strict and graceful version are provided.
This has the advantage that mac_selinux_access_check() can be used as a
function in all contexts. For example, parameters passed to it won't be
reported as unused if the "function" call is replaced with 0 on SELinux
disabled builds.
if we parse an xattr line that has no valid assignment, we might end up
with an empty ->xattr list. Don't hit assert on that, just go on.
Fixes: #15610
Let's allow more memory to be locked on beefy machines than on small
ones. The previous limit of 64M is the lower bound still. This
effectively means on a 4GB machine we can lock 512M, which should be
more than enough, but still not lock up the machine entirely under
pressure.
Fixes: #15053
WSL2 will soon (TM) include the "WSL2" string in /proc/sys/kernel/osrelease
so the workaround will no longer be necessary.
We have several different cloud images which do include the "microsoft"
string already, which would break this detection. They are for internal
usage at the moment, but the userspace side can come from all over the
place so it would be quite hard to track and downstream-patch to avoid
breakages.
This reverts commit a2f838d590.
On my laptop (Lenovo X1carbo 4th) I very occasionally see test-boot-timestamps
fail with this tb:
262/494 test-boot-timestamps FAIL 0.7348453998565674 s (killed by signal 6 SIGABRT)
08:12:48 SYSTEMD_LANGUAGE_FALLBACK_MAP='/home/zbyszek/src/systemd/src/locale/language-fallback-map' SYSTEMD_KBD_MODEL_MAP='/home/zbyszek/src/systemd/src/locale/kbd-model-map' PATH='/home/zbyszek/src/systemd/build:/home/zbyszek/.local/bin:/usr/lib64/qt-3.3/bin:/usr/share/Modules/bin:/usr/condabin:/usr/lib64/ccache:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/home/zbyszek/bin:/var/lib/snapd/snap/bin' /home/zbyszek/src/systemd/build/test-boot-timestamps
--- stderr ---
Failed to read $container of PID 1, ignoring: Permission denied
Found container virtualization none.
Failed to get SystemdOptions EFI variable, ignoring: Interrupted system call
Failed to read ACPI FPDT: Permission denied
Failed to read LoaderTimeInitUSec: Interrupted system call
Failed to read EFI loader data: Interrupted system call
Assertion 'q >= 0' failed at src/test/test-boot-timestamps.c:84, function main(). Aborting.
Normally it takes ~0.02s, but here there's a slowdown to 0.73 and things fail with EINTR.
This happens only occasionally, and I haven't been able to capture a strace.
It would be to ignore that case in test-boot-timestamps or always translate
EINTR to -ENODATA. Nevertheless, I think it's better to retry, since this gives
as more resilient behaviour and avoids a transient failure.
See
https://github.com/torvalds/linux/blob/master/fs/efivarfs/file.c#L75
and
bef3efbeb8.
Don't assume that 4MB can be allocated from stack since there could be smaller
DefaultLimitSTACK= in force, so let's use malloc(). NUL terminate the huge
strings by hand, also ensure termination in test_lz4_decompress_partial() and
optimize the memset() for the string.
Some items in /proc and /etc may not be accessible to poor unprivileged users
due to e.g. SELinux, BOFH or both, so check for EACCES and EPERM.
/var/tmp may be a symlink to /tmp and then path_compare() will always fail, so
let's stick to /tmp like elsewhere.
/tmp may be mounted with noexec option and then trying to execute scripts from
there would fail.
Detect and warn if seccomp is already in use, which could make seccomp test
fail if the syscalls are already blocked.
Unset $TMPDIR so it will not break specifier tests where %T is assumed to be
/tmp and %V /var/tmp.
It's not always mounted, e.g. during the build-time tests, it's running inside
a chroot (that's how Debian/Ubuntu build packages, in chroots) so this test
always fails because /sys/fs/cgroup isn't mounted.
When nothing at all is mounted at /sys/fs/cgroup, the fs.f_type is
SYSFS_MAGIC (0x62656572) which results in the confusing debug log:
"Unknown filesystem type 62656572 mounted on /sys/fs/cgroup."
Instead, if the f_type is SYSFS_MAGIC, a more accurate message is:
"No filesystem is currently mounted on /sys/fs/cgroup."
Split out of #15457, let's see if this is the culprit of the CI failure.
(also setting green label here, since @keszybz already greenlit it in that other PR)
We unregister binfmt_misc twice during shutdown with this change:
1. A previous commit added support for doing that in the final shutdown
phase, i.e. when we do the aggressive umount loop. This is the robust
thing to do, in case the earlier ("clean") shutdown phase didn't work
for some reason.
2. This commit adds support for doing that when systemd-binfmt.service
is stopped. This is a good idea so that people can order mounts
before the service if they want to register binaries from such
mounts, as in that case we'll undo the registration on shutdown
again, before unmounting those mounts.
And all that, just because of that weird "F" flag the kernel introduced
that can pin files...
Fixes: #14981
Let's just copy out the bit of the string we need, and let's make sure
we refuse rules called "status" and "register", since those are special
files in binfmt_misc's file system.
Apparently if the new "F" flag is used they might pin files, which
blocks us from unmounting things. Let's hence clear this up explicitly.
Before entering our umount loop.
Fixes: #14981
let's return ENOSYS in that case, to make things a bit less confusng.
Previously we'd just propagate ENOENT, which people might mistake as
applying to the object being modified rather than /proc/ just not being
there.
Let's return ENOSYS instead, i.e. an error clearly indicating that some
kernel API is not available. This hopefully should put people on a
better track.
Note that we only do the procfs check in the error path, which hopefully
means it's the less likely path.
We probably can add similar bits to more suitable codepaths dealing with
/proc/self/fd, but for now, let's pick to the ones noticed in #14745.
Fixes: #14745