The advantage is that is the name is mispellt, cpp will warn us.
$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build
squash! build-sys: use #if Y instead of #ifdef Y everywhere
v2:
- fix incorrect setting of HAVE_LIBIDN2
Let's optimize dynamic UID allocation a bit: if a StateDirectory= (or
suchlike) is configured, we start our allocation loop from that UID and
use it if it currently isn't used otherwise. This is beneficial as it
saves us from having to expensively recursively chown() these
directories in the typical case (which StateDirectory= does when it
notices that the owner of the directory doesn't match the UID picked).
With this in place we now have the a three-phase logic for allocating a
dynamic UID:
a) first, we try to use the owning UID of StateDirectory=,
CacheDirectory=, LogDirectory= if that exists and is currently
otherwise unused.
b) if that didn't work out, we hash the UID from the service name
c) if that didn't yield an unused UID either, randomly pick new ones
until we find a free one.
Let's clean up the interaction of StateDirectory= (and friends) to
DynamicUser=1: instead of creating these directories directly below
/var/lib, place them in /var/lib/private instead if DynamicUser=1 is
set, making that directory 0700 and owned by root:root. This way, if a
dynamic UID is later reused, access to the old run's state directory is
prohibited for that user. Then, use file system namespacing inside the
service to make /var/lib/private a readable tmpfs, hiding all state
directories that are not listed in StateDirectory=, and making access to
the actual state directory possible. Mount all directories listed in
StateDirectory= to the same places inside the service (which means
they'll now be mounted into the tmpfs instance). Finally, add a symlink
from the state directory name in /var/lib/ to the one in
/var/lib/private, so that both the host and the service can access the
path under the same location.
Here's an example: let's say a service runs with StateDirectory=foo.
When DynamicUser=0 is set, it will get the following setup, and no
difference between what the unit and what the host sees:
/var/lib/foo (created as directory)
Now, if DynamicUser=1 is set, we'll instead get this on the host:
/var/lib/private (created as directory with mode 0700, root:root)
/var/lib/private/foo (created as directory)
/var/lib/foo → private/foo (created as symlink)
And from inside the unit:
/var/lib/private (a tmpfs mount with mode 0755, root:root)
/var/lib/private/foo (bind mounted from the host)
/var/lib/foo → private/foo (the same symlink as above)
This takes inspiration from how container trees are protected below
/var/lib/machines: they generally reuse UIDs/GIDs of the host, but
because /var/lib/machines itself is set to 0700 host users cannot access
files in the container tree even if the UIDs/GIDs are reused. However,
for this commit we add one further trick: inside and outside of the unit
/var/lib/private is a different thing: outside it is a plain,
inaccessible directory, and inside it is a world-readable tmpfs mount
with only the whitelisted subdirs below it, bind mounte din. This
means, from the outside the dir acts as an access barrier, but from the
inside it does not. And the symlink created in /var/lib/foo itself
points across the barrier in both cases, so that root and the unit's
user always have access to these dirs without knowing the details of
this mounting magic.
This logic resolves a major shortcoming of DynamicUser=1 units:
previously they couldn't safely store persistant data. With this change
they can have their own private state, log and data directories, which
they can write to, but which are protected from UID recycling.
With this change, if RootDirectory= or RootImage= are used it is ensured
that the specified state/log/cache directories are always mounted in
from the host. This change of semantics I think is much preferable since
this means the root directory/image logic can be used easily for
read-only resource bundling (as all writable data resides outside of the
image). Note that this is a change of behaviour, but given that we
haven't released any systemd version with StateDirectory= and friends
implemented this should be a safe change to make (in particular as
previously it wasn't clear what would actually happen when used in
combination). Moreover, by making this change we can later add a "+"
modifier to these setings too working similar to the same modifier in
ReadOnlyPaths= and friends, making specified paths relative to the
container itself.
When putting together the namespace, always create the file or directory
we are supposed to bind mount on, the same way we do it for most other
stuff, for example mount units or systemd-nspawn's --bind= option.
This has the big benefit that we can use namespace bind mounts on dirs
in /tmp or /var/tmp even in conjunction with PrivateTmp=.
Before this patch we had an ordering problem: if we have no namespacing
enabled except for two bind mounts that intend to swap /a and /b via
bind mounts, then we'd execute the bind mount binding /b to /a, followed
by thebind mount from /a to /b, thus having the effect that /b is now
visible in both /a and /b, which was not intended.
With this change, as soon as any bind mount is configured we'll put
together the service mount namespace in a temporary directory instead of
operating directly in the root. This solves the problem in a
straightforward fashion: the source of bind mounts will always refer to
the host, and thus be unaffected from the bind mounts we already
created.
We already create /dev implicitly if PrivateTmp=yes is on, if it is
missing. Do so too for the other two API VFS, as well as for /dev if
PrivateTmp=yes is off but MountAPIVFS=yes is on (i.e. when /dev is bind
mounted from the host).
In most cases we followed the rule that the special _INVALID and _MAX
values we use in our enums use the full type name as prefix (in contrast
to regular values that we often make shorter), do so for
ExecDirectoryType as well.
No functional changes, just a little bit of renaming to make this code
more like the rest.
This is particularly useful when used in conjunction with DynamicUser=1,
where the UID might change for every invocation, but is useful in other
cases too, for example, when these directories are shared between
systems where the UID assignments differ slightly.
No need to wait for a timeout when we know things are not going to work out.
When the main process goes away and only notifications from the main process are
accepted, then we will not receive any notifications anymore.
Currently, all three of cgroup_good(), main_pid_good(),
control_pid_good() all return an "int" (two of them propagate errors).
It's a good thing to keep the three functions similar, so let's leave it
at that, but then let's clean up the invocation of the three functions
so that they always clearly acknowledge that the return value is not a
bool, but potentially negative.
The compiler should be good enough to figure this out on its own if this
is a static function, and it makes control_pid_good() an outlier anyway,
and decorators like this tend to bitrot. Hence, to keep things simple
and automatic, let's just drop the decorator.
The processes associated with a service are not just the ones in its
cgroup, but also the control and main processes, which might possibly
live outside of it, for example if they transitioned into their own
cgroups because they registered a PAM session of their own. Hence, if we
get a cgroup empty notification always check if the main PID is still
around before taking action too eagerly.
Fixes: #6045
This slightly changes how we log about failures. Previously,
service_enter_dead() would log that a service unit failed along with its
result code, and unit_notify() would do this again but without the
result code. For other unit types only the latter would take effect.
This cleans this up: we keep the message in unit_notify() only for debug
purposes, and add type-specific log lines to all our unit types that can
fail, and always place them before unit_notify() is invoked.
Or in other words: the duplicate log message for service units is
removed, and all other unit types get a more useful line with the
precise result code.
This makes sure that if we learn via inotify or another event source
that a cgroup is empty, and we checked that this is indeed the case (as
we might get spurious notifications through inotify, as the inotify
logic through the "cgroups.event" is pretty unspecific and might be
trigger for a variety of reasons), then we'll enqueue a defer event for
it, at a priority lower than SIGCHLD handling, so that we know for sure
that if there's waitid() data for a process we used it before
considering the cgroup empty notification.
Fixes: #6608
We are about to add second cgroup-related queue, called
"cgroup_empty_queue", hence let's rename "cgroup_queue" to
"cgroup_realize_queue" (as that is its purpose) to minimize confusion
about the two queues.
Just a rename, no functional changes.
Normally, Symlinks= failing is not considered fatal nor destructive.
Let's slightly alter behaviour here if RemoveOnStop= is turned on. In
that case the use in a way opted for destructive behaviour and we do
unlink all sockets and symlinks when the socket unit goes down. And that
means we might as well unlink any pre-existing if this mode is selected.
Yeah, it's a bit of a stretch to do this, but @OhNoMoreGit is right: if
RemoveOnStop= is on we are destructive regarding any pre-existing
symlinks on stop, and it would be quite weird if we wouldn't be on
start.
Note that this change does not make symlink creation failing fatal. I am
not entirely sure about whether it should be, but I am leaning towards
not making it fatal for two reasons: symlinks like this tend to be a
compatibility feature, and hence unlikely to be essential for operation,
in a way this breaks compatibility, and while doing that is not off the
table, we should probably avoid it if we are not entirely sure it's a
good thing.
Note that this also changes plain symlink() to symlink_idempotent() so
that existing symlinks with the right destination are nothing we log
about.
Fixes: #6920
Just in case something opened them, let's make sure glibc invalidates
them too.
Thankfully so far no library opened log channels behind our back, at
least as far as I know, hence this is actually a NOP, but let's better
be safe than sorry.
Now that logging can implicitly reopen the log streams when needed we
can log errors without any special magic, hence let's normalize things,
and log the same way we do everywhere else.
Also drop the redundant states and make all similar changes too.
Thankfully the swap.c state engine is much simpler than mount.c's, hence
this should be easier to digest.
The function returns true for all states that have a control process
running, and each time we call it that's what we want to know, hence
let's rename it accordingly. Moreover, the more generic unit states have
an ACTIVE state, and it is defined quite differently from the set of
states this function returns true for, hence let's avoid confusion and
not reuse the word "ACTIVE" here in a different context.
Finally, let's uppercase this, since in most ways it's pretty much
identical to a macro
This changes the mount unit state engine in the following ways:
1. The MOUNT_MOUNTING_SIGTERM and MOUNT_MOUNTING_SIGKILL are removed.
They have been pretty much equivalent to MOUNT_UNMOUNTING_SIGTERM and
MOUNT_UNMOUNTING_SIGKILL in what they do, and the outcome has been
the same as well: the unit is stopped. Hence, let's simplify things a
bit, and merge them. Note that we keep
MOUNT_REMOUNTING_{SIGTERM|SIGKILL} however, as those states have a
different outcome: the unit remains started.
2. mount_enter_signal() will now honour the SendSIGKILL= option of the
mount unit if it was set. This was previously done already when we
entered the signal states through a timeout, and was simply missing
here.
3. A new helper function mount_enter_dead_or_mounted() is added that
places the mount unit in either MOUNT_DEAD or MOUNT_MOUNTED,
depending on what the kernel thinks about the mount's state. This
function is called at various places now, wherever we finished an
operation, and want to make sure our own state reflects again what
the kernel thinks. Previously we had very similar code in a number of
places and in other places didn't recheck the kernel state. Let's do
that with the same logic and function at all relevant places now.
4. Rework mount_stop(): never forget about running control processes.
Instead: when we have a start (i.e. a /bin/mount) process running,
and are asked to stop, then enter the kill states for it, so that it
gets cleaned up. This fixes#6048. Moreover, when we have a reload
process running convert the possible states into the relevant
unmounting states, so that we can properly execute the requested
operation.
Fixes#6048
Let's only collect the first failure in the load result, and let's clear
it explicitly when we are about to enter a new reload operation. This
makes it more alike the handling of the main result value (which also
only stores the first failure), and also the handling of service.c's
reload state.
Due to the chown() logic socket units might end up with processes even
if no explicit command is defined for them, hence let's make sure these
processes are in the right cgroup, and that means within a slice.
Mount, swap and service units unconditionally are assigned to a slice
already, let's do the same here, too.
(This becomes more important as soon as the ebpf/firewall stuff is
merged, as there'll be another reason to fork off processes then)