This way we can corectly ensure that when a unit that requires some
controller goes away, we propagate the removal of it all the way up, so
that the controller is turned off in all the parents too.
Previously we tried to be smart: when a new unit appeared and it only
added controllers to the cgroup mask we'd update the cached members mask
in all parents by ORing in the controller flags in their cached values.
Unfortunately this was quite broken, as we missed some conditions when
this cache had to be reset (for example, when a unit got unloaded),
moreover the optimization doesn't work when a controller is removed
anyway (as in that case there's no other way for the parent to iterate
though all children if any other, remaining child unit still needs it).
Hence, let's simplify the logic substantially: instead of updating the
cache on the right events (which we didn't get right), let's simply
invalidate the cache, and generate it lazily when we encounter it later.
This should actually result in better behaviour as we don't have to
calculate the new members mask for a whole subtree whever we have the
suspicion something changed, but can delay it to the point where we
actually need the members mask.
This allows us to simplify things quite a bit, which is good, since
validating this cache for correctness is hard enough.
Fixes: #9512
After creating a cgroup we need to initialize its
"cgroup.subtree_control" file with the controllers its children want to
use. Currently we do so whenever the mkdir() on the cgroup succeeded,
i.e. when we know the cgroup is "fresh". Let's update the condition
slightly that we also do so when internally we assume a cgroup doesn't
exist yet, even if it already does (maybe left-over from a previous
run).
This shouldn't change anything IRL but make things a bit more robust.
Previously this would manipulate the realization mask for invalidating
the realization. This is a bit ugly though as the realization mask's
primary purpose to is to reflect in which hierarchies a cgroup currently
exists, and it's probably a good idea to keep that in sync with
realities.
We nowadays have the an explicit fields for invalidating cgroup
controller information, the "cgroup_invalidated_mask", let's use this
one instead.
The effect is pretty much the same, as the main consumer of these masks
(unit_has_mask_realize()) checks both anyway.
This changes cg_enable_everywhere() to return which controllers are
enabled for the specified cgroup. This information is then used to
correctly track the enablement mask currently in effect for a unit.
Moreover, when we try to turn off a controller, and this works, then
this is indicates that the parent unit might succesfully turn it off
now, too as our unit might have kept it busy.
So far, when realizing cgroups, i.e. when syncing up the kernel
representation of relevant cgroups with our own idea we would strictly
work from the root to the leaves. This is generally a good approach, as
when controllers are enabled this has to happen in root-to-leaves order.
However, when controllers are disabled this has to happen in the
opposite order: in leaves-to-root order (this is because controllers can
only be enabled in a child if it is already enabled in the parent, and
if it shall be disabled in the parent then it has to be disabled in the
child first, otherwise it is considered busy when it is attempted to
remove it in the parent).
To make things complicated when invalidating a unit's cgroup membershup
systemd can actually turn off some controllers previously turned on at
the very same time as it turns on other controllers previously turned
off. In such a case we have to work up leaves-to-root *and*
root-to-leaves right after each other. With this patch this is
implemented: we still generally operate root-to-leaves, but as soon as
we noticed we successfully turned off a controller previously turned on
for a cgroup we'll re-enqueue the cgroup realization for all parents of
a unit, thus implementing leaves-to-root where necessary.
Let's tweak when precisely to apply cgroup attributes on the root
cgroup.
With this we now follow the following rules:
1. On cgroupsv2 we never apply any regular cgroups to the host root,
since the attributes generally do not exist there.
2. On cgroupsv1 we do not apply any "weight" or "shares" style
attributes to the host root cgroup, since they don't make much sense
on the top level where there's only one group, hence no need to
compare weights against each other. The other attributes are applied
to the host root cgroup however.
3. In any case we don't apply attributes to the root of container
environments (and --user roots), under the assumption that this is
managed by the manager further up. (Note that on cgroupsv2 this is
even enforced by the kernel)
4. BPF pseudo-attributes are applied in all cases (since we can have as
many of them as we want)
Let's emphasize that this function checks for the host root cgroup, i.e.
returns false for the root cgroup when we run in a container where
CLONE_NEWCGROUP is used. There has been some confusion around this
already, for example cgroup_context_apply() uses the function
incorrectly (which we'll fix in a later commit).
Just some refactoring, not change in behaviour.
If we run in a container we shouldn't patch around this, and most likely
we can't anyway, and there's not much point in complaining about this.
Hence let's strictly say: the agent is private property of the host's
system instance, nothing else.
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.
I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
We keep a mark whether a single-shot timer was triggered in the caller's
variable initial. When such a timer elapses while we are
serializing/deserializing the inner state, we consider the timer
incorrectly as elapsed and don't trigger it later.
This patch exploits last_trigger timestamp that we already serialize,
hence we can eliminate the argument initial completely.
A reproducer for OnBootSec= timers:
cat >repro.c <<EOD
/*
* Compile: gcc repro.c -o repro
* Run: ./repro
*/
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <time.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
char command[1024];
int pause;
struct timespec now;
while (1) {
usleep(rand() % 200000); // prevent periodic repeats
clock_gettime(CLOCK_MONOTONIC, &now);
printf("%i\n", now.tv_sec);
system("rm -f $PWD/mark");
snprintf(command, 1024, "systemd-run --user --on-boot=%i --timer-property=AccuracySec=100ms "
"touch $PWD/mark", now.tv_sec + 1);
system(command);
system("systemctl --user list-timers");
pause = (1000000000 - now.tv_nsec)/1000 - 70000; // fiddle to hit the middle of reloading
usleep(pause > 0 ? pause : 0);
system("systemctl --user daemon-reload");
sync();
sleep(2);
if (open("./mark", 0) < 0)
if (errno == ENOENT) {
printf("mark file does not exist\n");
break;
}
}
return 0;
}
EOD
For PID 1 we adjust the umask to 0, but generators should not run that
way, given that they might be implemented as shell scripts and such.
Let's hence explicitly adjust the umask for them.
We already do this for unit generators. Let's do this for env
generators, too.
Otherwise we keep collecting stuff from env generators, and we really
shouldn't.
This was working properly on reexec but not on reload, as for reexec we
would always start fresh, but for reload would reuse the Manager object
and hence its default environment set.
Fixes: #10671
We now don't enable the CPU controller just for CPU accounting if we are
on 4.15+ and using pure unified hierarchy, as this is provided
externally to the CPU controller. This makes CPUAccounting=yes
essentially free, so enabling it by default when it's cheap seems like a
good idea.
systemd only uses functions that are as of Linux 4.15+ provided
externally to the CPU controller (currently usage_usec), so if we have a
new enough kernel, we don't need to set CGROUP_MASK_CPU for
CPUAccounting=true as the CPU controller does not need to necessarily be
enabled in this case.
Part of this patch is modelled on an earlier patch by Ryutaroh Matsumoto
(see PR #9665).