cgroup: s/cgroups? ?v?([0-9])/cgroup v\1/gI
Nitpicky, but we've used a lot of random spacings and names in the past, but we're trying to be completely consistent on "cgroup vN" now. Generated by `fd -0 | xargs -0 -n1 sed -ri --follow-symlinks 's/cgroups? ?v?([0-9])/cgroup v\1/gI'`. I manually ignored places where it's not appropriate to replace (eg. "cgroup2" fstype and in src/shared/linux).
This commit is contained in:
parent
788291d3b4
commit
4e1dfa45e9
14
NEWS
14
NEWS
|
@ -133,13 +133,13 @@ CHANGES WITH 240:
|
||||||
|
|
||||||
* The new "MemoryMin=" unit file property may now be used to set the
|
* The new "MemoryMin=" unit file property may now be used to set the
|
||||||
memory usage protection limit of processes invoked by the unit. This
|
memory usage protection limit of processes invoked by the unit. This
|
||||||
controls the cgroupsv2 memory.min attribute. Similarly, the new
|
controls the cgroup v2 memory.min attribute. Similarly, the new
|
||||||
"IODeviceLatencyTargetSec=" property has been added, wrapping the new
|
"IODeviceLatencyTargetSec=" property has been added, wrapping the new
|
||||||
cgroupsv2 io.latency cgroup property for configuring per-service I/O
|
cgroup v2 io.latency cgroup property for configuring per-service I/O
|
||||||
latency.
|
latency.
|
||||||
|
|
||||||
* systemd now supports the cgroupsv2 devices BPF logic, as counterpart
|
* systemd now supports the cgroup v2 devices BPF logic, as counterpart
|
||||||
to the cgroupsv1 "devices" cgroup controller.
|
to the cgroup v1 "devices" cgroup controller.
|
||||||
|
|
||||||
* systemd-escape now is able to combine --unescape with --template. It
|
* systemd-escape now is able to combine --unescape with --template. It
|
||||||
also learnt a new option --instance for extracting and unescaping the
|
also learnt a new option --instance for extracting and unescaping the
|
||||||
|
@ -355,7 +355,7 @@ CHANGES WITH 240:
|
||||||
|
|
||||||
* The JoinControllers= option in system.conf is no longer supported, as
|
* The JoinControllers= option in system.conf is no longer supported, as
|
||||||
it didn't work correctly, is hard to support properly, is legacy (as
|
it didn't work correctly, is hard to support properly, is legacy (as
|
||||||
the concept only exists on cgroupsv1) and apparently wasn't used.
|
the concept only exists on cgroup v1) and apparently wasn't used.
|
||||||
|
|
||||||
* Journal messages that are generated whenever a unit enters the failed
|
* Journal messages that are generated whenever a unit enters the failed
|
||||||
state are now tagged with a unique MESSAGE_ID. Similarly, messages
|
state are now tagged with a unique MESSAGE_ID. Similarly, messages
|
||||||
|
@ -992,7 +992,7 @@ CHANGES WITH 238:
|
||||||
instance to migrate processes if it itself gets the request to
|
instance to migrate processes if it itself gets the request to
|
||||||
migrate processes and the kernel refuses this due to access
|
migrate processes and the kernel refuses this due to access
|
||||||
restrictions. Thanks to this "systemd-run --scope --user …" works
|
restrictions. Thanks to this "systemd-run --scope --user …" works
|
||||||
again in pure cgroups v2 environments when invoked from the user
|
again in pure cgroup v2 environments when invoked from the user
|
||||||
session scope.
|
session scope.
|
||||||
|
|
||||||
* A new TemporaryFileSystem= setting can be used to mask out part of
|
* A new TemporaryFileSystem= setting can be used to mask out part of
|
||||||
|
@ -2708,7 +2708,7 @@ CHANGES WITH 231:
|
||||||
desired options.
|
desired options.
|
||||||
|
|
||||||
* systemd now supports the "memory" cgroup controller also on
|
* systemd now supports the "memory" cgroup controller also on
|
||||||
cgroupsv2.
|
cgroup v2.
|
||||||
|
|
||||||
* The systemd-cgtop tool now optionally takes a control group path as
|
* The systemd-cgtop tool now optionally takes a control group path as
|
||||||
command line argument. If specified, the control group list shown is
|
command line argument. If specified, the control group list shown is
|
||||||
|
|
2
TODO
2
TODO
|
@ -58,7 +58,7 @@ Features:
|
||||||
* when a socket unit is spawned with an AF_UNIX path in /var/run, complain and
|
* when a socket unit is spawned with an AF_UNIX path in /var/run, complain and
|
||||||
patch it to use /run instead
|
patch it to use /run instead
|
||||||
|
|
||||||
* set memory.oom.group in cgroupsv2 for all leaf cgroups (kernel v4.19+)
|
* set memory.oom.group in cgroup v2 for all leaf cgroups (kernel v4.19+)
|
||||||
|
|
||||||
* add a new syscall group "@esoteric" for more esoteric stuff such as bpf() and
|
* add a new syscall group "@esoteric" for more esoteric stuff such as bpf() and
|
||||||
usefaultd() and make systemd-analyze check for it.
|
usefaultd() and make systemd-analyze check for it.
|
||||||
|
|
|
@ -17,7 +17,7 @@ container managers.
|
||||||
|
|
||||||
Before you read on, please make sure you read the low-level [kernel
|
Before you read on, please make sure you read the low-level [kernel
|
||||||
documentation about
|
documentation about
|
||||||
cgroupsv2](https://www.kernel.org/doc/Documentation/cgroup-v2.txt). This
|
cgroup v2](https://www.kernel.org/doc/Documentation/cgroup-v2.txt). This
|
||||||
documentation then adds in the higher-level view from systemd.
|
documentation then adds in the higher-level view from systemd.
|
||||||
|
|
||||||
This document augments the existing documentation we already have:
|
This document augments the existing documentation we already have:
|
||||||
|
@ -34,8 +34,8 @@ wiki documentation into this very document, too.)
|
||||||
## Two Key Design Rules
|
## Two Key Design Rules
|
||||||
|
|
||||||
Much of the philosophy behind these concepts is based on a couple of basic
|
Much of the philosophy behind these concepts is based on a couple of basic
|
||||||
design ideas of cgroupsv2 (which we however try to adapt as far as we can to
|
design ideas of cgroup v2 (which we however try to adapt as far as we can to
|
||||||
cgroupsv1 too). Specifically two cgroupsv2 rules are the most relevant:
|
cgroup v1 too). Specifically two cgroup v2 rules are the most relevant:
|
||||||
|
|
||||||
1. The **no-processes-in-inner-nodes** rule: this means that it's not permitted
|
1. The **no-processes-in-inner-nodes** rule: this means that it's not permitted
|
||||||
to have processes directly attached to a cgroup that also has child cgroups and
|
to have processes directly attached to a cgroup that also has child cgroups and
|
||||||
|
@ -58,45 +58,45 @@ your container manager creates and manages cgroups in the system's root cgroup
|
||||||
you violate rule #2, as the root cgroup is managed by systemd and hence off
|
you violate rule #2, as the root cgroup is managed by systemd and hence off
|
||||||
limits to everybody else.
|
limits to everybody else.
|
||||||
|
|
||||||
Note that rule #1 is generally enforced by the kernel if cgroupsv2 is used: as
|
Note that rule #1 is generally enforced by the kernel if cgroup v2 is used: as
|
||||||
soon as you add a process to a cgroup it is ensured the rule is not
|
soon as you add a process to a cgroup it is ensured the rule is not
|
||||||
violated. On cgroupsv1 this rule didn't exist, and hence isn't enforced, even
|
violated. On cgroup v1 this rule didn't exist, and hence isn't enforced, even
|
||||||
though it's a good thing to follow it then too. Rule #2 is not enforced on
|
though it's a good thing to follow it then too. Rule #2 is not enforced on
|
||||||
either cgroupsv1 nor cgroupsv2 (this is UNIX after all, in the general case
|
either cgroup v1 nor cgroup v2 (this is UNIX after all, in the general case
|
||||||
root can do anything, modulo SELinux and friends), but if you ignore it you'll
|
root can do anything, modulo SELinux and friends), but if you ignore it you'll
|
||||||
be in constant pain as various pieces of software will fight over cgroup
|
be in constant pain as various pieces of software will fight over cgroup
|
||||||
ownership.
|
ownership.
|
||||||
|
|
||||||
Note that cgroupsv1 is currently the most deployed implementation, even though
|
Note that cgroup v1 is currently the most deployed implementation, even though
|
||||||
it's semantically broken in many ways, and in many cases doesn't actually do
|
it's semantically broken in many ways, and in many cases doesn't actually do
|
||||||
what people think it does. cgroupsv2 is where things are going, and most new
|
what people think it does. cgroup v2 is where things are going, and most new
|
||||||
kernel features in this area are only added to cgroupsv2, and not cgroupsv1
|
kernel features in this area are only added to cgroup v2, and not cgroup v1
|
||||||
anymore. For example cgroupsv2 provides proper cgroup-empty notifications, has
|
anymore. For example cgroup v2 provides proper cgroup-empty notifications, has
|
||||||
support for all kinds of per-cgroup BPF magic, supports secure delegation of
|
support for all kinds of per-cgroup BPF magic, supports secure delegation of
|
||||||
cgroup trees to less privileged processes and so on, which all are not
|
cgroup trees to less privileged processes and so on, which all are not
|
||||||
available on cgroupsv1.
|
available on cgroup v1.
|
||||||
|
|
||||||
## Three Different Tree Setups 🌳
|
## Three Different Tree Setups 🌳
|
||||||
|
|
||||||
systemd supports three different modes how cgroups are set up. Specifically:
|
systemd supports three different modes how cgroups are set up. Specifically:
|
||||||
|
|
||||||
1. **Unified** — this is the simplest mode, and exposes a pure cgroupsv2
|
1. **Unified** — this is the simplest mode, and exposes a pure cgroup v2
|
||||||
logic. In this mode `/sys/fs/cgroup` is the only mounted cgroup API file system
|
logic. In this mode `/sys/fs/cgroup` is the only mounted cgroup API file system
|
||||||
and all available controllers are exclusively exposed through it.
|
and all available controllers are exclusively exposed through it.
|
||||||
|
|
||||||
2. **Legacy** — this is the traditional cgroupsv1 mode. In this mode the
|
2. **Legacy** — this is the traditional cgroup v1 mode. In this mode the
|
||||||
various controllers each get their own cgroup file system mounted to
|
various controllers each get their own cgroup file system mounted to
|
||||||
`/sys/fs/cgroup/<controller>/`. On top of that systemd manages its own cgroup
|
`/sys/fs/cgroup/<controller>/`. On top of that systemd manages its own cgroup
|
||||||
hierarchy for managing purposes as `/sys/fs/cgroup/systemd/`.
|
hierarchy for managing purposes as `/sys/fs/cgroup/systemd/`.
|
||||||
|
|
||||||
3. **Hybrid** — this is a hybrid between the unified and legacy mode. It's set
|
3. **Hybrid** — this is a hybrid between the unified and legacy mode. It's set
|
||||||
up mostly like legacy, except that there's also an additional hierarchy
|
up mostly like legacy, except that there's also an additional hierarchy
|
||||||
`/sys/fs/cgroup/unified/` that contains the cgroupsv2 hierarchy. (Note that in
|
`/sys/fs/cgroup/unified/` that contains the cgroup v2 hierarchy. (Note that in
|
||||||
this mode the unified hierarchy won't have controllers attached, the
|
this mode the unified hierarchy won't have controllers attached, the
|
||||||
controllers are all mounted as separate hierarchies as in legacy mode,
|
controllers are all mounted as separate hierarchies as in legacy mode,
|
||||||
i.e. `/sys/fs/cgroup/unified/` is purely and exclusively about core cgroupsv2
|
i.e. `/sys/fs/cgroup/unified/` is purely and exclusively about core cgroup v2
|
||||||
functionality and not about resource management.) In this mode compatibility
|
functionality and not about resource management.) In this mode compatibility
|
||||||
with cgroupsv1 is retained while some cgroupsv2 features are available
|
with cgroup v1 is retained while some cgroup v2 features are available
|
||||||
too. This mode is a stopgap. Don't bother with this too much unless you have
|
too. This mode is a stopgap. Don't bother with this too much unless you have
|
||||||
too much free time.
|
too much free time.
|
||||||
|
|
||||||
|
@ -116,7 +116,7 @@ to talk of one specific cgroup and actually mean the same cgroup in all
|
||||||
available controller hierarchies. E.g. if we talk about the cgroup `/foo/bar/`
|
available controller hierarchies. E.g. if we talk about the cgroup `/foo/bar/`
|
||||||
then we actually mean `/sys/fs/cgroup/cpu/foo/bar/` as well as
|
then we actually mean `/sys/fs/cgroup/cpu/foo/bar/` as well as
|
||||||
`/sys/fs/cgroup/memory/foo/bar/`, `/sys/fs/cgroup/pids/foo/bar/`, and so on.
|
`/sys/fs/cgroup/memory/foo/bar/`, `/sys/fs/cgroup/pids/foo/bar/`, and so on.
|
||||||
Note that in cgroupsv2 the controller hierarchies aren't orthogonal, hence
|
Note that in cgroup v2 the controller hierarchies aren't orthogonal, hence
|
||||||
thinking about them as orthogonal won't help you in the long run anyway.
|
thinking about them as orthogonal won't help you in the long run anyway.
|
||||||
|
|
||||||
If you wonder how to detect which of these three modes is currently used, use
|
If you wonder how to detect which of these three modes is currently used, use
|
||||||
|
@ -168,7 +168,7 @@ cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/quux.service/`.
|
||||||
By default systemd sets up four slice units:
|
By default systemd sets up four slice units:
|
||||||
|
|
||||||
1. `-.slice` is the root slice. i.e. the parent of everything else. On the host
|
1. `-.slice` is the root slice. i.e. the parent of everything else. On the host
|
||||||
system it maps directly to the top-level directory of cgroupsv2.
|
system it maps directly to the top-level directory of cgroup v2.
|
||||||
|
|
||||||
2. `system.slice` is where system services are by default placed, unless
|
2. `system.slice` is where system services are by default placed, unless
|
||||||
configured otherwise.
|
configured otherwise.
|
||||||
|
@ -187,8 +187,8 @@ above are just the defaults.
|
||||||
|
|
||||||
Container managers and suchlike often want to control cgroups directly using
|
Container managers and suchlike often want to control cgroups directly using
|
||||||
the raw kernel APIs. That's entirely fine and supported, as long as proper
|
the raw kernel APIs. That's entirely fine and supported, as long as proper
|
||||||
*delegation* is followed. Delegation is a concept we inherited from cgroupsv2,
|
*delegation* is followed. Delegation is a concept we inherited from cgroup v2,
|
||||||
but we expose it on cgroupsv1 too. Delegation means that some parts of the
|
but we expose it on cgroup v1 too. Delegation means that some parts of the
|
||||||
cgroup tree may be managed by different managers than others. As long as it is
|
cgroup tree may be managed by different managers than others. As long as it is
|
||||||
clear which manager manages which part of the tree each one can do within its
|
clear which manager manages which part of the tree each one can do within its
|
||||||
sub-graph of the tree whatever it wants.
|
sub-graph of the tree whatever it wants.
|
||||||
|
@ -217,7 +217,7 @@ guarantees:
|
||||||
hierarchy (in unified and hybrid mode) as well as on systemd's own private
|
hierarchy (in unified and hybrid mode) as well as on systemd's own private
|
||||||
hierarchy (in legacy and hybrid mode). It won't pass ownership of the legacy
|
hierarchy (in legacy and hybrid mode). It won't pass ownership of the legacy
|
||||||
controller hierarchies. Delegation to less privileges processes is not safe
|
controller hierarchies. Delegation to less privileges processes is not safe
|
||||||
in cgroupsv1 (as a limitation of the kernel), hence systemd won't facilitate
|
in cgroup v1 (as a limitation of the kernel), hence systemd won't facilitate
|
||||||
access to it.
|
access to it.
|
||||||
|
|
||||||
3. Any BPF IP filter programs systemd installs will be installed with
|
3. Any BPF IP filter programs systemd installs will be installed with
|
||||||
|
@ -322,19 +322,19 @@ to work on that, and widen your horizon a bit. You are welcome.
|
||||||
systemd supports a number of controllers (but not all). Specifically, supported
|
systemd supports a number of controllers (but not all). Specifically, supported
|
||||||
are:
|
are:
|
||||||
|
|
||||||
* on cgroupsv1: `cpu`, `cpuacct`, `blkio`, `memory`, `devices`, `pids`
|
* on cgroup v1: `cpu`, `cpuacct`, `blkio`, `memory`, `devices`, `pids`
|
||||||
* on cgroupsv2: `cpu`, `io`, `memory`, `pids`
|
* on cgroup v2: `cpu`, `io`, `memory`, `pids`
|
||||||
|
|
||||||
It is our intention to natively support all cgroupsv2 controllers as they are
|
It is our intention to natively support all cgroup v2 controllers as they are
|
||||||
added to the kernel. However, regarding cgroupsv1: at this point we will not
|
added to the kernel. However, regarding cgroup v1: at this point we will not
|
||||||
add support for any other controllers anymore. This means systemd currently
|
add support for any other controllers anymore. This means systemd currently
|
||||||
does not and will never manage the following controllers on cgroupsv1:
|
does not and will never manage the following controllers on cgroup v1:
|
||||||
`freezer`, `cpuset`, `net_cls`, `perf_event`, `net_prio`, `hugetlb`. Why not?
|
`freezer`, `cpuset`, `net_cls`, `perf_event`, `net_prio`, `hugetlb`. Why not?
|
||||||
Depending on the case, either their API semantics or implementations aren't
|
Depending on the case, either their API semantics or implementations aren't
|
||||||
really usable, or it's very clear they have no future on cgroupsv2, and we
|
really usable, or it's very clear they have no future on cgroup v2, and we
|
||||||
won't add new code for stuff that clearly has no future.
|
won't add new code for stuff that clearly has no future.
|
||||||
|
|
||||||
Effectively this means that all those mentioned cgroupsv1 controllers are up
|
Effectively this means that all those mentioned cgroup v1 controllers are up
|
||||||
for grabs: systemd won't manage them, and hence won't delegate them to your
|
for grabs: systemd won't manage them, and hence won't delegate them to your
|
||||||
code (however, systemd will still mount their hierarchies, simply because it
|
code (however, systemd will still mount their hierarchies, simply because it
|
||||||
mounts all controller hierarchies it finds available in the kernel). If you
|
mounts all controller hierarchies it finds available in the kernel). If you
|
||||||
|
@ -355,9 +355,9 @@ cgroups in them — from previous runs, and be extra careful with them as they
|
||||||
might still carry settings that might not be valid anymore.
|
might still carry settings that might not be valid anymore.
|
||||||
|
|
||||||
Note a particular asymmetry here: if your systemd version doesn't support a
|
Note a particular asymmetry here: if your systemd version doesn't support a
|
||||||
specific controller on cgroupsv1 you can still make use of it for delegation,
|
specific controller on cgroup v1 you can still make use of it for delegation,
|
||||||
by directly fiddling with its hierarchy and replicating the cgroup tree there
|
by directly fiddling with its hierarchy and replicating the cgroup tree there
|
||||||
as necessary (as suggested above). However, on cgroupsv2 this is different:
|
as necessary (as suggested above). However, on cgroup v2 this is different:
|
||||||
separately mounted hierarchies are not available, and delegation has always to
|
separately mounted hierarchies are not available, and delegation has always to
|
||||||
happen through systemd itself. This means: when you update your kernel and it
|
happen through systemd itself. This means: when you update your kernel and it
|
||||||
adds a new, so far unseen controller, and you want to use it for delegation,
|
adds a new, so far unseen controller, and you want to use it for delegation,
|
||||||
|
@ -417,7 +417,7 @@ unified you (of course, I guess) need to provide only `/sys/fs/cgroup/` itself.
|
||||||
arbitrary naming, you might need to escape some of the names (for example,
|
arbitrary naming, you might need to escape some of the names (for example,
|
||||||
you really don't want to create a cgroup named `tasks`, just because the
|
you really don't want to create a cgroup named `tasks`, just because the
|
||||||
user created a container by that name, because `tasks` after all is a magic
|
user created a container by that name, because `tasks` after all is a magic
|
||||||
attribute in cgroupsv1, and your `mkdir()` will hence fail with `EEXIST`. In
|
attribute in cgroup v1, and your `mkdir()` will hence fail with `EEXIST`. In
|
||||||
systemd we do escaping by prefixing names that might collide with a kernel
|
systemd we do escaping by prefixing names that might collide with a kernel
|
||||||
attribute name with an underscore. You might want to do the same, but this
|
attribute name with an underscore. You might want to do the same, but this
|
||||||
is really up to you how you do it. Just do it, and be careful.
|
is really up to you how you do it. Just do it, and be careful.
|
||||||
|
@ -462,9 +462,9 @@ unified you (of course, I guess) need to provide only `/sys/fs/cgroup/` itself.
|
||||||
to get the cgroup for a unit. The method `GetUnitByControlGroup()` may be
|
to get the cgroup for a unit. The method `GetUnitByControlGroup()` may be
|
||||||
used to get the unit for a cgroup.)
|
used to get the unit for a cgroup.)
|
||||||
|
|
||||||
6. ⚡ Think twice before delegating cgroupsv1 controllers to less privileged
|
6. ⚡ Think twice before delegating cgroup v1 controllers to less privileged
|
||||||
containers. It's not safe, you basically allow your containers to freeze the
|
containers. It's not safe, you basically allow your containers to freeze the
|
||||||
system with that and worse. Delegation is a strongpoint of cgroupsv2 though,
|
system with that and worse. Delegation is a strongpoint of cgroup v2 though,
|
||||||
and there it's safe to treat delegation boundaries as privilege boundaries.
|
and there it's safe to treat delegation boundaries as privilege boundaries.
|
||||||
|
|
||||||
And that's it for now. If you have further questions, refer to the systemd
|
And that's it for now. If you have further questions, refer to the systemd
|
||||||
|
|
|
@ -872,7 +872,7 @@ int cg_set_access(
|
||||||
bool fatal;
|
bool fatal;
|
||||||
};
|
};
|
||||||
|
|
||||||
/* cgroupsv1, aka legacy/non-unified */
|
/* cgroup v1, aka legacy/non-unified */
|
||||||
static const struct Attribute legacy_attributes[] = {
|
static const struct Attribute legacy_attributes[] = {
|
||||||
{ "cgroup.procs", true },
|
{ "cgroup.procs", true },
|
||||||
{ "tasks", false },
|
{ "tasks", false },
|
||||||
|
@ -880,7 +880,7 @@ int cg_set_access(
|
||||||
{},
|
{},
|
||||||
};
|
};
|
||||||
|
|
||||||
/* cgroupsv2, aka unified */
|
/* cgroup v2, aka unified */
|
||||||
static const struct Attribute unified_attributes[] = {
|
static const struct Attribute unified_attributes[] = {
|
||||||
{ "cgroup.procs", true },
|
{ "cgroup.procs", true },
|
||||||
{ "cgroup.subtree_control", true },
|
{ "cgroup.subtree_control", true },
|
||||||
|
@ -2039,7 +2039,7 @@ int cg_get_keyed_attribute(
|
||||||
char **v;
|
char **v;
|
||||||
int r;
|
int r;
|
||||||
|
|
||||||
/* Reads one or more fields of a cgroupsv2 keyed attribute file. The 'keys' parameter should be an strv with
|
/* Reads one or more fields of a cgroup v2 keyed attribute file. The 'keys' parameter should be an strv with
|
||||||
* all keys to retrieve. The 'ret_values' parameter should be passed as string size with the same number of
|
* all keys to retrieve. The 'ret_values' parameter should be passed as string size with the same number of
|
||||||
* entries as 'keys'. On success each entry will be set to the value of the matching key.
|
* entries as 'keys'. On success each entry will be set to the value of the matching key.
|
||||||
*
|
*
|
||||||
|
@ -2491,7 +2491,7 @@ int cg_kernel_controllers(Set **ret) {
|
||||||
|
|
||||||
static thread_local CGroupUnified unified_cache = CGROUP_UNIFIED_UNKNOWN;
|
static thread_local CGroupUnified unified_cache = CGROUP_UNIFIED_UNKNOWN;
|
||||||
|
|
||||||
/* The hybrid mode was initially implemented in v232 and simply mounted cgroup v2 on /sys/fs/cgroup/systemd. This
|
/* The hybrid mode was initially implemented in v232 and simply mounted cgroup2 on /sys/fs/cgroup/systemd. This
|
||||||
* unfortunately broke other tools (such as docker) which expected the v1 "name=systemd" hierarchy on
|
* unfortunately broke other tools (such as docker) which expected the v1 "name=systemd" hierarchy on
|
||||||
* /sys/fs/cgroup/systemd. From v233 and on, the hybrid mode mountnbs v2 on /sys/fs/cgroup/unified and maintains
|
* /sys/fs/cgroup/systemd. From v233 and on, the hybrid mode mountnbs v2 on /sys/fs/cgroup/unified and maintains
|
||||||
* "name=systemd" hierarchy on /sys/fs/cgroup/systemd for compatibility with other tools.
|
* "name=systemd" hierarchy on /sys/fs/cgroup/systemd for compatibility with other tools.
|
||||||
|
@ -2739,13 +2739,13 @@ bool cg_is_legacy_wanted(void) {
|
||||||
if (wanted >= 0)
|
if (wanted >= 0)
|
||||||
return wanted;
|
return wanted;
|
||||||
|
|
||||||
/* Check if we have cgroups2 already mounted. */
|
/* Check if we have cgroup v2 already mounted. */
|
||||||
if (cg_unified_flush() >= 0 &&
|
if (cg_unified_flush() >= 0 &&
|
||||||
unified_cache == CGROUP_UNIFIED_ALL)
|
unified_cache == CGROUP_UNIFIED_ALL)
|
||||||
return (wanted = false);
|
return (wanted = false);
|
||||||
|
|
||||||
/* Otherwise, assume that at least partial legacy is wanted,
|
/* Otherwise, assume that at least partial legacy is wanted,
|
||||||
* since cgroups2 should already be mounted at this point. */
|
* since cgroup v2 should already be mounted at this point. */
|
||||||
return (wanted = true);
|
return (wanted = true);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
@ -48,13 +48,13 @@ typedef enum CGroupMask {
|
||||||
CGROUP_MASK_BPF_FIREWALL = CGROUP_CONTROLLER_TO_MASK(CGROUP_CONTROLLER_BPF_FIREWALL),
|
CGROUP_MASK_BPF_FIREWALL = CGROUP_CONTROLLER_TO_MASK(CGROUP_CONTROLLER_BPF_FIREWALL),
|
||||||
CGROUP_MASK_BPF_DEVICES = CGROUP_CONTROLLER_TO_MASK(CGROUP_CONTROLLER_BPF_DEVICES),
|
CGROUP_MASK_BPF_DEVICES = CGROUP_CONTROLLER_TO_MASK(CGROUP_CONTROLLER_BPF_DEVICES),
|
||||||
|
|
||||||
/* All real cgroupv1 controllers */
|
/* All real cgroup v1 controllers */
|
||||||
CGROUP_MASK_V1 = CGROUP_MASK_CPU|CGROUP_MASK_CPUACCT|CGROUP_MASK_BLKIO|CGROUP_MASK_MEMORY|CGROUP_MASK_DEVICES|CGROUP_MASK_PIDS,
|
CGROUP_MASK_V1 = CGROUP_MASK_CPU|CGROUP_MASK_CPUACCT|CGROUP_MASK_BLKIO|CGROUP_MASK_MEMORY|CGROUP_MASK_DEVICES|CGROUP_MASK_PIDS,
|
||||||
|
|
||||||
/* All real cgroupv2 controllers */
|
/* All real cgroup v2 controllers */
|
||||||
CGROUP_MASK_V2 = CGROUP_MASK_CPU|CGROUP_MASK_IO|CGROUP_MASK_MEMORY|CGROUP_MASK_PIDS,
|
CGROUP_MASK_V2 = CGROUP_MASK_CPU|CGROUP_MASK_IO|CGROUP_MASK_MEMORY|CGROUP_MASK_PIDS,
|
||||||
|
|
||||||
/* All cgroupv2 BPF pseudo-controllers */
|
/* All cgroup v2 BPF pseudo-controllers */
|
||||||
CGROUP_MASK_BPF = CGROUP_MASK_BPF_FIREWALL|CGROUP_MASK_BPF_DEVICES,
|
CGROUP_MASK_BPF = CGROUP_MASK_BPF_FIREWALL|CGROUP_MASK_BPF_DEVICES,
|
||||||
|
|
||||||
_CGROUP_MASK_ALL = CGROUP_CONTROLLER_TO_MASK(_CGROUP_CONTROLLER_MAX) - 1
|
_CGROUP_MASK_ALL = CGROUP_CONTROLLER_TO_MASK(_CGROUP_CONTROLLER_MAX) - 1
|
||||||
|
|
|
@ -104,7 +104,7 @@ static const char *maybe_format_bytes(char *buf, size_t l, bool is_valid, uint64
|
||||||
|
|
||||||
static bool is_root_cgroup(const char *path) {
|
static bool is_root_cgroup(const char *path) {
|
||||||
|
|
||||||
/* Returns true if the specified path belongs to the root cgroup. The root cgroup is special on cgroupsv2 as it
|
/* Returns true if the specified path belongs to the root cgroup. The root cgroup is special on cgroup v2 as it
|
||||||
* carries only very few attributes in order not to export multiple truth about system state as most
|
* carries only very few attributes in order not to export multiple truth about system state as most
|
||||||
* information is available elsewhere in /proc anyway. We need to be able to deal with that, and need to get
|
* information is available elsewhere in /proc anyway. We need to be able to deal with that, and need to get
|
||||||
* our data from different sources in that case.
|
* our data from different sources in that case.
|
||||||
|
|
|
@ -881,7 +881,7 @@ static void cgroup_context_apply(
|
||||||
/* In fully unified mode these attributes don't exist on the host cgroup root. On legacy the weights exist, but
|
/* In fully unified mode these attributes don't exist on the host cgroup root. On legacy the weights exist, but
|
||||||
* setting the weight makes very little sense on the host root cgroup, as there are no other cgroups at this
|
* setting the weight makes very little sense on the host root cgroup, as there are no other cgroups at this
|
||||||
* level. The quota exists there too, but any attempt to write to it is refused with EINVAL. Inside of
|
* level. The quota exists there too, but any attempt to write to it is refused with EINVAL. Inside of
|
||||||
* containers we want to leave control of these to the container manager (and if cgroupsv2 delegation is used
|
* containers we want to leave control of these to the container manager (and if cgroup v2 delegation is used
|
||||||
* we couldn't even write to them if we wanted to). */
|
* we couldn't even write to them if we wanted to). */
|
||||||
if ((apply_mask & CGROUP_MASK_CPU) && !is_local_root) {
|
if ((apply_mask & CGROUP_MASK_CPU) && !is_local_root) {
|
||||||
|
|
||||||
|
@ -925,7 +925,7 @@ static void cgroup_context_apply(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/* The 'io' controller attributes are not exported on the host's root cgroup (being a pure cgroupsv2
|
/* The 'io' controller attributes are not exported on the host's root cgroup (being a pure cgroup v2
|
||||||
* controller), and in case of containers we want to leave control of these attributes to the container manager
|
* controller), and in case of containers we want to leave control of these attributes to the container manager
|
||||||
* (and we couldn't access that stuff anyway, even if we tried if proper delegation is used). */
|
* (and we couldn't access that stuff anyway, even if we tried if proper delegation is used). */
|
||||||
if ((apply_mask & CGROUP_MASK_IO) && !is_local_root) {
|
if ((apply_mask & CGROUP_MASK_IO) && !is_local_root) {
|
||||||
|
@ -1067,7 +1067,7 @@ static void cgroup_context_apply(
|
||||||
|
|
||||||
/* In unified mode 'memory' attributes do not exist on the root cgroup. In legacy mode 'memory.limit_in_bytes'
|
/* In unified mode 'memory' attributes do not exist on the root cgroup. In legacy mode 'memory.limit_in_bytes'
|
||||||
* exists on the root cgroup, but any writes to it are refused with EINVAL. And if we run in a container we
|
* exists on the root cgroup, but any writes to it are refused with EINVAL. And if we run in a container we
|
||||||
* want to leave control to the container manager (and if proper cgroupsv2 delegation is used we couldn't even
|
* want to leave control to the container manager (and if proper cgroup v2 delegation is used we couldn't even
|
||||||
* write to this if we wanted to.) */
|
* write to this if we wanted to.) */
|
||||||
if ((apply_mask & CGROUP_MASK_MEMORY) && !is_local_root) {
|
if ((apply_mask & CGROUP_MASK_MEMORY) && !is_local_root) {
|
||||||
|
|
||||||
|
@ -1109,7 +1109,7 @@ static void cgroup_context_apply(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/* On cgroupsv2 we can apply BPF everywhere. On cgroupsv1 we apply it everywhere except for the root of
|
/* On cgroup v2 we can apply BPF everywhere. On cgroup v1 we apply it everywhere except for the root of
|
||||||
* containers, where we leave this to the manager */
|
* containers, where we leave this to the manager */
|
||||||
if ((apply_mask & (CGROUP_MASK_DEVICES | CGROUP_MASK_BPF_DEVICES)) &&
|
if ((apply_mask & (CGROUP_MASK_DEVICES | CGROUP_MASK_BPF_DEVICES)) &&
|
||||||
(is_host_root || cg_all_unified() > 0 || !is_local_root)) {
|
(is_host_root || cg_all_unified() > 0 || !is_local_root)) {
|
||||||
|
@ -1841,14 +1841,14 @@ static bool unit_has_mask_realized(
|
||||||
/* Returns true if this unit is fully realized. We check four things:
|
/* Returns true if this unit is fully realized. We check four things:
|
||||||
*
|
*
|
||||||
* 1. Whether the cgroup was created at all
|
* 1. Whether the cgroup was created at all
|
||||||
* 2. Whether the cgroup was created in all the hierarchies we need it to be created in (in case of cgroupsv1)
|
* 2. Whether the cgroup was created in all the hierarchies we need it to be created in (in case of cgroup v1)
|
||||||
* 3. Whether the cgroup has all the right controllers enabled (in case of cgroupsv2)
|
* 3. Whether the cgroup has all the right controllers enabled (in case of cgroup v2)
|
||||||
* 4. Whether the invalidation mask is currently zero
|
* 4. Whether the invalidation mask is currently zero
|
||||||
*
|
*
|
||||||
* If you wonder why we mask the target realization and enable mask with CGROUP_MASK_V1/CGROUP_MASK_V2: note
|
* If you wonder why we mask the target realization and enable mask with CGROUP_MASK_V1/CGROUP_MASK_V2: note
|
||||||
* that there are three sets of bitmasks: CGROUP_MASK_V1 (for real cgroupv1 controllers), CGROUP_MASK_V2 (for
|
* that there are three sets of bitmasks: CGROUP_MASK_V1 (for real cgroup v1 controllers), CGROUP_MASK_V2 (for
|
||||||
* real cgroupv2 controllers) and CGROUP_MASK_BPF (for BPF-based pseudo-controllers). Now, cgroup_realized_mask
|
* real cgroup v2 controllers) and CGROUP_MASK_BPF (for BPF-based pseudo-controllers). Now, cgroup_realized_mask
|
||||||
* is only matters for cgroupsv1 controllers, and cgroup_enabled_mask only used for cgroupsv2, and if they
|
* is only matters for cgroup v1 controllers, and cgroup_enabled_mask only used for cgroup v2, and if they
|
||||||
* differ in the others, we don't really care. (After all, the cgroup_enabled_mask tracks with controllers are
|
* differ in the others, we don't really care. (After all, the cgroup_enabled_mask tracks with controllers are
|
||||||
* enabled through cgroup.subtree_control, and since the BPF pseudo-controllers don't show up there, they
|
* enabled through cgroup.subtree_control, and since the BPF pseudo-controllers don't show up there, they
|
||||||
* simply don't matter. */
|
* simply don't matter. */
|
||||||
|
|
|
@ -3137,9 +3137,9 @@ static int exec_child(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/* If delegation is enabled we'll pass ownership of the cgroup to the user of the new process. On cgroupsv1
|
/* If delegation is enabled we'll pass ownership of the cgroup to the user of the new process. On cgroup v1
|
||||||
* this is only about systemd's own hierarchy, i.e. not the controller hierarchies, simply because that's not
|
* this is only about systemd's own hierarchy, i.e. not the controller hierarchies, simply because that's not
|
||||||
* safe. On cgroupsv2 there's only one hierarchy anyway, and delegation is safe there, hence in that case only
|
* safe. On cgroup v2 there's only one hierarchy anyway, and delegation is safe there, hence in that case only
|
||||||
* touch a single hierarchy too. */
|
* touch a single hierarchy too. */
|
||||||
if (params->cgroup_path && context->user && (params->flags & EXEC_CGROUP_DELEGATE)) {
|
if (params->cgroup_path && context->user && (params->flags & EXEC_CGROUP_DELEGATE)) {
|
||||||
r = cg_set_access(SYSTEMD_CGROUP_CONTROLLER, params->cgroup_path, uid, gid);
|
r = cg_set_access(SYSTEMD_CGROUP_CONTROLLER, params->cgroup_path, uid, gid);
|
||||||
|
|
|
@ -248,8 +248,8 @@ typedef struct Unit {
|
||||||
|
|
||||||
/* Counterparts in the cgroup filesystem */
|
/* Counterparts in the cgroup filesystem */
|
||||||
char *cgroup_path;
|
char *cgroup_path;
|
||||||
CGroupMask cgroup_realized_mask; /* In which hierarchies does this unit's cgroup exist? (only relevant on cgroupsv1) */
|
CGroupMask cgroup_realized_mask; /* In which hierarchies does this unit's cgroup exist? (only relevant on cgroup v1) */
|
||||||
CGroupMask cgroup_enabled_mask; /* Which controllers are enabled (or more correctly: enabled for the children) for this unit's cgroup? (only relevant on cgroupsv2) */
|
CGroupMask cgroup_enabled_mask; /* Which controllers are enabled (or more correctly: enabled for the children) for this unit's cgroup? (only relevant on cgroup v2) */
|
||||||
CGroupMask cgroup_invalidated_mask; /* A mask specifiying controllers which shall be considered invalidated, and require re-realization */
|
CGroupMask cgroup_invalidated_mask; /* A mask specifiying controllers which shall be considered invalidated, and require re-realization */
|
||||||
CGroupMask cgroup_members_mask; /* A cache for the controllers required by all children of this cgroup (only relevant for slice units) */
|
CGroupMask cgroup_members_mask; /* A cache for the controllers required by all children of this cgroup (only relevant for slice units) */
|
||||||
int cgroup_inotify_wd;
|
int cgroup_inotify_wd;
|
||||||
|
|
|
@ -257,7 +257,7 @@ static int client_context_read_cgroup(Server *s, ClientContext *c, const char *u
|
||||||
|
|
||||||
/* We use the unit ID passed in as fallback if we have nothing cached yet and cg_pid_get_path_shifted()
|
/* We use the unit ID passed in as fallback if we have nothing cached yet and cg_pid_get_path_shifted()
|
||||||
* failed or process is running in a root cgroup. Zombie processes are automatically migrated to root cgroup
|
* failed or process is running in a root cgroup. Zombie processes are automatically migrated to root cgroup
|
||||||
* on cgroupsv1 and we want to be able to map log messages from them too. */
|
* on cgroup v1 and we want to be able to map log messages from them too. */
|
||||||
if (unit_id && !c->unit) {
|
if (unit_id && !c->unit) {
|
||||||
c->unit = strdup(unit_id);
|
c->unit = strdup(unit_id);
|
||||||
if (c->unit)
|
if (c->unit)
|
||||||
|
|
|
@ -33,7 +33,7 @@ if grep -q cgroup2 /proc/filesystems ; then
|
||||||
# And now check again, "io" should have vanished
|
# And now check again, "io" should have vanished
|
||||||
grep -qv io /sys/fs/cgroup/system.slice/cgroup.controllers
|
grep -qv io /sys/fs/cgroup/system.slice/cgroup.controllers
|
||||||
else
|
else
|
||||||
echo "Skipping TEST-19-DELEGATE, as the kernel doesn't actually support cgroupsv2" >&2
|
echo "Skipping TEST-19-DELEGATE, as the kernel doesn't actually support cgroup v2" >&2
|
||||||
fi
|
fi
|
||||||
|
|
||||||
echo OK > /testok
|
echo OK > /testok
|
||||||
|
|
Loading…
Reference in New Issue