Commit Graph

265 Commits

Author SHA1 Message Date
Lennart Poettering d742f4b54b cgroup: correct mangling of return values
Let's nor return the unmangled return value before we actually mangle
it.

Fixes: #11062
2018-12-10 16:09:41 +01:00
Lennart Poettering 92a993041a cgroup: call cg_all_unified() right before using the result
Let's not query it before we actually need it.
2018-12-10 16:09:41 +01:00
Lennart Poettering ea900d2bfe
Merge pull request #11009 from poettering/root-cgroup-again
tweak root cgroup attribute fiddling for cgroupsv1 again
2018-12-04 12:33:03 +01:00
Chris Down c72703e26d cgroup: Add DisableControllers= directive to disable controller in subtree
Some controllers (like the CPU controller) have a performance cost that
is non-trivial on certain workloads. While this can be mitigated and
improved to an extent, there will for some controllers always be some
overheads associated with the benefits gained from the controller.
Inside Facebook, the fix applied has been to disable the CPU controller
forcibly with `cgroup_disable=cpu` on the kernel command line.

This presents a problem: to disable or reenable the controller, a reboot
is required, but this is quite cumbersome and slow to do for many
thousands of machines, especially machines where disabling/enabling a
stateful service on a machine is a matter of several minutes.

Currently systemd provides some configuration knobs for these in the
form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the
like. The limitation of these is that Default*Accounting is overrideable
by individual services, of which any one could decide to reenable a
controller within the hierarchy at any point just by using a controller
feature implicitly (eg. `CPUWeight`), even if the use of that CPU
feature could just be opportunistic. Since many services are provided by
the distribution, or by upstream teams at a particular organisation,
it's not a sustainable solution to simply try to find and remove
offending directives from these units.

This commit presents a more direct solution -- a DisableControllers=
directive that forcibly disallows a controller from being enabled within
a subtree.
2018-12-03 15:40:31 +00:00
Chris Down 4f6f62e468 cgroup: Traverse leaves to realised cgroup to release controllers
This adds a depth-first version of unit_realize_cgroup_now which can
only do depth-first disabling of controllers, in preparation for the
DisableController= directive.
2018-12-03 14:37:39 +00:00
Chris Down a57669d290 cgroup: Rework unit_realize_cgroup_now to explicitly be breadth-first
systemd currently doesn't really expend much effort in disabling
controllers. unit_realize_cgroup_now *may* be able to disable a
controller in the basic case when using cgroup v2, but generally won't
manage as downstream dependents may still use it.

This code doesn't add any logic to fix that, but it starts the process
of moving to have a breadth-first version of unit_realize_cgroup_now for
enabling, and a depth-first version of unit_realize_cgroup_now for
disabling.
2018-12-03 14:37:39 +00:00
Chris Down 0d2d6fbf15 cgroup: Move attribute application into unit_create_cgroup
We always end up doing these together, so just colocate them and require
manager state for unit_create_cgroup.
2018-12-03 14:37:38 +00:00
Lennart Poettering 67e2ea1542 cgroup: suffix unit file settings with "=" in log output
Let's follow our recommendations from CODING_STYLE and suffix unit file
settings with "=" everywhere.
2018-12-01 12:57:51 +01:00
Lennart Poettering be2c032781 core: don't try to write CPU quota and memory limit cgroup attrs on root cgroup
In the kernel sources attempts to write to either are refused with
EINVAL. Not sure why these attributes are exported anyway on cgroupsv1,
but this means we really should ignore them altogether.

This simplifies our code as this means cgroupsv1 is more alike cgroupsv2
in this regard.

Fixes: #10969
2018-12-01 12:57:51 +01:00
Lennart Poettering d5aecba6e0 cgroup: use device_path_parse_major_minor() also for block device paths
Not only when we populate the "devices" cgroup controller we need
major/minor numbers, but for the io/blkio one it's the same, hence let's
use the same logic for both.
2018-11-29 20:21:39 +01:00
Lennart Poettering 846b3bd61e stat-util: add new APIs device_path_make_{major_minor|canonical}() and device_path_parse_major_minor()
device_path_make_{major_minor|canonical)  generate device node paths
given a mode_t and a dev_t. We have similar code all over the place,
let's unify this in one place. The former will generate a "/dev/char/"
or "/dev/block" path, and never go to disk. The latter then goes to disk
and resolves that path to the actual path of the device node.

device_path_parse_major_minor() reverses device_path_make_major_minor(),
also withozut going to disk.

We have similar code doing something like this at various places, let's
unify this in a single set of functions. This also allows us to teach
them special tricks, for example handling of the
/run/systemd/inaccessible/{blk|chr} device nodes, which we use for
masking device nodes, and which do not exist in /dev/char/* and
/dev/block/*
2018-11-29 20:21:39 +01:00
Lennart Poettering 8e8b5d2e6d cgroups: beef up DeviceAllow= syntax a bit
Previously we'd allow pattern expressions such as "char-input" to match
all input devices. Internally, this would look up the right major to
test in /proc/devices. With this commit the syntax is slightly extended:

- "char-*" can be used to match any kind of character device, and
  similar "block-*. This expression would work previously already, but
  instead of actually installing a wildcard match it would install many
  individual matches for everything listed in /proc/devices.

- "char-<MAJOR>" with "<MAJOR>" being a numerical parameter works now
  too. This allows clients to install whitelist items by specifying the
  major directly.

The main reason to add these is to provide limited compat support for
clients that for some reason contain whitelists with major/minor numbers
(such as OCI containers).
2018-11-29 20:21:39 +01:00
Lennart Poettering 74c48bf5a8 core: add special handling for devices cgroup allow lists for /dev/block/* and /dev/char/* device nodes
This adds some code to hanlde /dev/block/* and /dev/char/* device node
paths specially: instead of actually stat()ing them we'll just parse the
major/minor name from the name. This is useful 'hack' to allow clients
to install whitelists for devices that don't actually have to exist.

Also, let's similarly handle /run/systemd/inaccessible/{blk|chr}. This
allows us to simplify our built-in default whitelist to not require a
"ignore_enoent" mode for these nodes.

In general we should be careful with hardcoding major/minor numbers, but
in this case this should safe.
2018-11-29 20:03:56 +01:00
Lennart Poettering 5af8805872 cgroup: drastically simplify caching of cgroups members mask
Previously we tried to be smart: when a new unit appeared and it only
added controllers to the cgroup mask we'd update the cached members mask
in all parents by ORing in the controller flags in their cached values.
Unfortunately this was quite broken, as we missed some conditions when
this cache had to be reset (for example, when a unit got unloaded),
moreover the optimization doesn't work when a controller is removed
anyway (as in that case there's no other way for the parent to iterate
though all children if any other, remaining child unit still needs it).
Hence, let's simplify the logic substantially: instead of updating the
cache on the right events (which we didn't get right), let's simply
invalidate the cache, and generate it lazily when we encounter it later.
This should actually result in better behaviour as we don't have to
calculate the new members mask for a whole subtree whever we have the
suspicion something changed, but can delay it to the point where we
actually need the members mask.

This allows us to simplify things quite a bit, which is good, since
validating this cache for correctness is hard enough.

Fixes: #9512
2018-11-23 13:41:37 +01:00
Lennart Poettering 8a0d538815 cgroup: extend comment on what unit_release_cgroup() is for 2018-11-23 13:41:37 +01:00
Lennart Poettering 1fd3a10c38 cgroup: extend reasons when we realize the enable mask
After creating a cgroup we need to initialize its
"cgroup.subtree_control" file with the controllers its children want to
use. Currently we do so whenever the mkdir() on the cgroup succeeded,
i.e. when we know the cgroup is "fresh". Let's update the condition
slightly that we also do so when internally we assume a cgroup doesn't
exist yet, even if it already does (maybe left-over from a previous
run).

This shouldn't change anything IRL but make things a bit more robust.
2018-11-23 13:41:37 +01:00
Lennart Poettering d5095dcd30 cgroup: tighten call that detects whether we need to realize a unit's cgroup a bit, and comment why 2018-11-23 13:41:37 +01:00
Lennart Poettering 27c4ed790a cgroup: simplify check whether it makes sense to realize a cgroup 2018-11-23 13:41:37 +01:00
Lennart Poettering e00068e71f cgroup: in unit_invalidate_cgroup() actually modify invalidation mask
Previously this would manipulate the realization mask for invalidating
the realization. This is a bit ugly though as the realization mask's
primary purpose to is to reflect in which hierarchies a cgroup currently
exists, and it's probably a good idea to keep that in sync with
realities.

We nowadays have the an explicit fields for invalidating cgroup
controller information, the "cgroup_invalidated_mask", let's use this
one instead.

The effect is pretty much the same, as the main consumer of these masks
(unit_has_mask_realize()) checks both anyway.
2018-11-23 13:41:37 +01:00
Lennart Poettering 27adcc9737 cgroup: be more careful with which controllers we can enable/disable on a cgroup
This changes cg_enable_everywhere() to return which controllers are
enabled for the specified cgroup. This information is then used to
correctly track the enablement mask currently in effect for a unit.
Moreover, when we try to turn off a controller, and this works, then
this is indicates that the parent unit might succesfully turn it off
now, too as our unit might have kept it busy.

So far, when realizing cgroups, i.e. when syncing up the kernel
representation of relevant cgroups with our own idea we would strictly
work from the root to the leaves. This is generally a good approach, as
when controllers are enabled this has to happen in root-to-leaves order.
However, when controllers are disabled this has to happen in the
opposite order: in leaves-to-root order (this is because controllers can
only be enabled in a child if it is already enabled in the parent, and
if it shall be disabled in the parent then it has to be disabled in the
child first, otherwise it is considered busy when it is attempted to
remove it in the parent).

To make things complicated when invalidating a unit's cgroup membershup
systemd can actually turn off some controllers previously turned on at
the very same time as it turns on other controllers previously turned
off. In such a case we have to work up leaves-to-root *and*
root-to-leaves right after each other. With this patch this is
implemented: we still generally operate root-to-leaves, but as soon as
we noticed we successfully turned off a controller previously turned on
for a cgroup we'll re-enqueue the cgroup realization for all parents of
a unit, thus implementing leaves-to-root where necessary.
2018-11-23 13:41:37 +01:00
Lennart Poettering 26a17ca280 cgroup: add explanatory comment 2018-11-23 12:24:37 +01:00
Lennart Poettering 442ce7759c cgroup: units that aren't loaded properly should not result in cgroup controllers being pulled in
This shouldn't make much difference in real life, but is a bit cleaner.
2018-11-23 12:24:37 +01:00
Lennart Poettering 1649244588 cgroup: make unit_get_needs_bpf_firewall() static too 2018-11-23 12:24:37 +01:00
Lennart Poettering 53aea74a60 cgroup: make some functions static 2018-11-23 12:24:37 +01:00
Lennart Poettering 52fecf20b9 cgroup: fine tune when to apply cgroup attributes to the root cgroup
Let's tweak when precisely to apply cgroup attributes on the root
cgroup.

With this we now follow the following rules:

1. On cgroupsv2 we never apply any regular cgroups to the host root,
   since the attributes generally do not exist there.

2. On cgroupsv1 we do not apply any "weight" or "shares" style
   attributes to the host root cgroup, since they don't make much sense
   on the top level where there's only one group, hence no need to
   compare weights against each other. The other attributes are applied
   to the host root cgroup however.

3. In any case we don't apply attributes to the root of container
   environments (and --user roots), under the assumption that this is
   managed by the manager further up. (Note that on cgroupsv2 this is
   even enforced by the kernel)

4. BPF pseudo-attributes are applied in all cases (since we can have as
   many of them as we want)
2018-11-23 12:24:37 +01:00
Lennart Poettering 589a5f7a38 cgroup: append \n to static strings we write to cgroup attributes
This is a bit cleaner since we when we format numeric limits we append
it. And this way write_string_file() doesn't have to append it.
2018-11-23 12:24:37 +01:00
Lennart Poettering 28cfdc5aeb cgroup: tighten manager_owns_host_root_cgroup() a bit
This tightening is not strictly necessary (as the m->cgroup_root check
further down does the same), but let's make this explicit.
2018-11-23 12:24:37 +01:00
Lennart Poettering 611c4f8afb cgroup: rename {manager_owns|unit_has}_root_cgroup() → .._host_root_cgroup()
Let's emphasize that this function checks for the host root cgroup, i.e.
returns false for the root cgroup when we run in a container where
CLONE_NEWCGROUP is used. There has been some confusion around this
already, for example cgroup_context_apply() uses the function
incorrectly (which we'll fix in a later commit).

Just some refactoring, not change in behaviour.
2018-11-23 12:24:37 +01:00
Lennart Poettering 293d32df39 cgroup: add a common routine for writing to attributes, and logging about it
We can use this at quite a few places, and this allows us to shorten our
code quite a bit.
2018-11-23 12:24:37 +01:00
Lennart Poettering 39b9fefb2e cgroup: add a new macro for determining log level for cgroup attr write failures
For now, let's use it only at one place, but a follow-up commit will
make more use of it.
2018-11-23 12:24:37 +01:00
Lennart Poettering 2c74e12bb3 cgroup: ignore EPERM for a couple of more attribute writes 2018-11-23 12:24:37 +01:00
Lennart Poettering 8c83840772 cgroup: add comment explaining why we ignore EINVAL at two places
These are just copies from further down.
2018-11-23 12:24:37 +01:00
Lennart Poettering 73fe5314bf cgroup: suffix settings with "=" in log messages where appropriate 2018-11-23 12:24:37 +01:00
Lennart Poettering a0c339ed4b cgroup: only install cgroup release agent when we own the root cgroup
If we run in a container we shouldn't patch around this, and most likely
we can't anyway, and there's not much point in complaining about this.
Hence let's strictly say: the agent is private property of the host's
system instance, nothing else.
2018-11-23 12:24:37 +01:00
Lennart Poettering de8a711a58 cgroup: use structured initialization 2018-11-23 12:24:37 +01:00
Chris Down f98c25850f cgroup v2: Don't require CPU controller for CPU accounting in 4.15+
systemd only uses functions that are as of Linux 4.15+ provided
externally to the CPU controller (currently usage_usec), so if we have a
new enough kernel, we don't need to set CGROUP_MASK_CPU for
CPUAccounting=true as the CPU controller does not need to necessarily be
enabled in this case.

Part of this patch is modelled on an earlier patch by Ryutaroh Matsumoto
(see PR #9665).
2018-11-18 12:21:41 +00:00
Lennart Poettering fae9bc298a cgroup: when determining which controllers we need, always extend the mask according to cpu/cpuacct joint mounting
Note that for cgroup_context_get_mask() this doesn't actually change
much, but it does prepare the ground for #10507 later on.
2018-11-16 14:54:13 +01:00
Lennart Poettering 8d33dca2ff core: fix capitalization of CPUShares= settings 2018-11-16 14:46:49 +01:00
Lennart Poettering c2baf11c36 cgroup: actually reset the cgroup invalidation mask after we made our changes
Previously we never unmasked the mask after it was set once. Let's fix
that.
2018-11-08 15:20:52 +01:00
Yu Watanabe 5e1ee764e1 core: include error cause in log message 2018-10-20 01:40:42 +09:00
Lennart Poettering 490c5a37cb tree-wide: some automatic coccinelle fixes (#10463)
Nothing fancy, just coccinelle doing its work.
2018-10-20 00:07:46 +09:00
Lennart Poettering c66e60a838 cgroup: FOREACH_LINE excorcism 2018-10-18 16:23:45 +02:00
Lennart Poettering 913c898ca0 cgroup: voidify a few things 2018-10-13 12:37:13 +02:00
Lennart Poettering b9839ac9d9 cgroup: make sure whitelist_device() always returns a valid return value
CID 1396094
2018-10-13 12:37:13 +02:00
Zbigniew Jędrzejewski-Szmek f436470ae1
Merge pull request #10343 from poettering/manager-state-fix
various fixes for PID1's Manager object
2018-10-10 12:36:16 +02:00
Lennart Poettering 638cece45d core: clean up test run flags
Let's make them typesafe, and let's add a nice macro helper for checking
if we are in a test run, which should make testing for this much easier
to read for most cases.
2018-10-09 19:43:43 +02:00
Roman Gushchin 084c700780 core: support cgroup v2 device controller
Cgroup v2 provides the eBPF-based device controller, which isn't currently
supported by systemd. This commit aims to provide such support.

There are no user-visible changes, just the device policy and whitelist
start working if cgroup v2 is used.
2018-10-09 09:47:51 -07:00
Roman Gushchin 17f149556a core: refactor bpf firewall support into a pseudo-controller
The idea is to introduce a concept of bpf-based pseudo-controllers
to make adding new bpf-based features easier.
2018-10-09 09:46:08 -07:00
Tejun Heo 6ae4283cb1 core: add IODeviceLatencyTargetSec
This adds support for the following proposed latency based IO control
mechanism.

  https://lkml.org/lkml/2018/6/5/428
2018-08-22 16:46:18 +02:00
Yu Watanabe fd870bac25 core: introduce cgroup_add_device_allow() 2018-08-06 13:42:14 +09:00