Commit Graph

340 Commits

Author SHA1 Message Date
Michal Sekletar d910f4c2b2 core/cgroup: fix return value of unit_cgorup_freezer_action()
We should return 0 only if current freezer state, as reported by the
kernel, is already the desired state. Otherwise, we would dispatch
return dbus message prematurely in bus_unit_method_freezer_generic().

Thanks to Frantisek Sumsal for reporting the issue.
2020-05-07 22:19:19 +02:00
Michal Sekletár d9e45bc3ab core: introduce support for cgroup freezer
With cgroup v2 the cgroup freezer is implemented as a cgroup
attribute called cgroup.freeze. cgroup can be frozen by writing "1"
to the file and kernel will send us a notification through
"cgroup.events" after the operation is finished and processes in the
cgroup entered quiescent state, i.e. they are not scheduled to
run. Writing "0" to the attribute file does the inverse and process
execution is resumed.

This commit exposes above low-level functionality through systemd's DBus
API. Each unit type must provide specialized implementation for these
methods, otherwise, we return an error. So far only service, scope, and
slice unit types provide the support. It is possible to check if a
given unit has the support using CanFreeze() DBus property.

Note that DBus API has a synchronous behavior and we dispatch the reply
to freeze/thaw requests only after the kernel has notified us that
requested operation was completed.
2020-04-30 19:02:51 +02:00
Anita Zhang 613328c3e2 cgroup-util: helper to cg_get_attribute and convert to uint64_t
A common pattern in the codebase is reading a cgroup memory value
and converting it to a uint64_t. Let's make it a helper and refactor a
few places to use it so it's more concise.
2020-03-24 16:05:16 -07:00
Benjamin Berg b7cf4b4ef5 core: Fix resolution of nested DM devices for cgroups
When using the cgroups IO controller, the device that is controlled
should always be the toplevel block device. This did not get resolved
correctly for an LVM volume inside a LUKS device, because the code would
only resolve one level of indirection.

Fix this by recursively looking up the originating block device for DM
devices.

Resolves: #15008
2020-03-06 16:11:44 +00:00
Lennart Poettering c238a2f889 cgroup: minor comment improvement
As pointed out here:

https://github.com/systemd/systemd/pull/14564#discussion_r366305882
2020-01-14 16:57:51 +01:00
Lennart Poettering 48fd01e5f3 cgroup: drop redundant if check 2020-01-14 10:44:58 +01:00
Lennart Poettering e1e98911a8 cgroup: update only siblings that got realized once
Fixes: #14475
Replaces: #14554
2020-01-14 10:44:19 +01:00
Lennart Poettering 95ae4d1420 cgroup: drop unnecessary {} 2020-01-14 10:44:19 +01:00
Lennart Poettering a0d6590c4e cgroup: no need to cast dev_t to dev_t 2020-01-14 10:44:19 +01:00
Lennart Poettering 57f1030b13 cgroup: use log_warning_errno() where possible 2020-01-14 10:44:19 +01:00
Lennart Poettering 65f6b6bdcb core: fix re-realization of cgroup siblings
This is a fix-up for eef85c4a3f which
broke this.

Tracked down by @w-simon

Fixes: #14453
2020-01-09 17:31:41 +01:00
Lennart Poettering 3288ea8f32 core: set "trusted.delegate" xattr on cgroups that are delegation boundaries
Let's mark cgroups that are delegation boundaries to us. This can then
be used by tools such as "systemd-cgls" to show where the next manager
takes over.
2019-11-20 17:50:12 +01:00
Zbigniew Jędrzejewski-Szmek 3a0f06c41a core: make TasksMax a partially dynamic property
TasksMax= and DefaultTasksMax= can be specified as percentages. We don't
actually document of what the percentage is relative to, but the implementation
uses the smallest of /proc/sys/kernel/pid_max, /proc/sys/kernel/threads-max,
and /sys/fs/cgroup/pids.max (when present). When the value is a percentage,
we immediately convert it to an absolute value. If the limit later changes
(which can happen e.g. when systemd-sysctl runs), the absolute value becomes
outdated.

So let's store either the percentage or absolute value, whatever was specified,
and only convert to an absolute value when the value is used. For example, when
starting a unit, the absolute value will be calculated when the cgroup for
the unit is created.

Fixes #13419.
2019-11-14 18:41:54 +01:00
Zbigniew Jędrzejewski-Szmek 45669ae264 bpf: make sure the kernel do not submit an invalid program if no pattern matched
It turns out that the kernel verifier would reject a program we would build
if there was a whitelist, but no entries in the whitelist matched.
The program would approximately like this:
   0: (61) r2 = *(u32 *)(r1 +0)
   1: (54) w2 &= 65535
   2: (61) r3 = *(u32 *)(r1 +0)
   3: (74) w3 >>= 16
   4: (61) r4 = *(u32 *)(r1 +4)
   5: (61) r5 = *(u32 *)(r1 +8)
  48: (b7) r0 = 0
  49: (05) goto pc+1
  50: (b7) r0 = 1
  51: (95) exit
and insn 50 is unreachable, which is illegal. We would then either keep a
previous version of the program or allow everything. Make sure we build a
valid program that simply rejects everything.
2019-11-11 15:14:09 +01:00
Zbigniew Jędrzejewski-Szmek 0848715cab bpf: make bpf_devices_apply_policy() independent of any unit code 2019-11-11 14:55:57 +01:00
Zbigniew Jędrzejewski-Szmek 8b139557fe core: split out one more function 2019-11-11 14:55:52 +01:00
Zbigniew Jędrzejewski-Szmek a9aac7d8dd core: also split out helper to handle static device nodes 2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek 124e05b3b6 core: move bpf devices implementation to bpf-devices.[ch] and rename
The naming of the functions was a complete mess: the most specific functions
which don't know anything about cgroups had "cgroup_" prefix, while more
general functions which took a node path and a cgroup for reporting had no
prefix. Let's use "bpf_devices_" for the latter group, and "bpf_prog_*" for the
rest.

The main goal of this move is to split the implementation from the calling code
and add unit tests in a later patch.
2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek 084870f9c0 core: rename CGROUP_AUTO/STRICT/CLOSED to CGROUP_DEVICE_POLICY_…
The old names were very generic, and when used without context it wasn't at all
clear that they are about the devices policy.
2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek 672cbcbc20 bpf: return normally from whitelist_major()
All callers do (void) anyway, so we can just use normal return here.
2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek d49c180826 bpf: do not bother adding device patterns after whitelisting the full class
This seems to have been unintentional.
2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek fa6613fc53 bpf: refactor how we create device major:minor whitelists
No functional change intended except for minor adjustments to error messages.
2019-11-10 23:22:15 +01:00
Lennart Poettering c259ac9aa2 cpuset: fix indentation and log about OOM we otherwise ignore 2019-11-01 10:21:53 +01:00
Lennart Poettering 85c3b27891 cgroup: add some basic OOM safety where it was missing 2019-11-01 10:21:35 +01:00
Zbigniew Jędrzejewski-Szmek 2cea199ec1 core: pass around pointer, not struct
Since this is a static function, the compiler is likely to optimize it away
anyway, but let's do the normal thing here.
2019-10-11 13:46:05 +02:00
Chris Down bc0623df16 cgroup: analyze: Report memory configurations that deviate from systemd
This is the most basic consumer of the new systemd-vs-kernel checker,
both acting as a reasonable standalone exerciser of the code, and also
as a way for easy inspection of deviations from systemd internal state.
2019-10-03 15:06:25 +01:00
Chris Down 6dfb92823f cgroup: analyze: Match standard dump format
We're the only ones left using = as the delimiter, which looks really
weird in `systemd-analyze dump`. Use `: ` like everyone else.
2019-10-03 15:06:25 +01:00
Chris Down 74b5fb272f cgroup: Allow checking systemd-internal limits against the kernel
We currently don't have any mitigations against another privileged user
on the system messing with the cgroup hierarchy, bringing the system out
of line with what we've set in systemd. We also don't have any real way
to surface this to the user (we do have logs, but you have to know to
look in the first place).

There are a few possible solutions:

1. Maintaining our own cgroup tree with the new fsopen API and having a
   read-only copy for everyone else. However, there are some
   complications on this front, and this may be infeasible in some
   environments. I'd rate this as a longer term effort that's tangential
   to this patch.
2. Actively checking for changes with {fa,i}notify and changing them
   back afterwards to match our configuration again. This is also
   possible, but it's also good to have a way to do passive monitoring
   of the situation without taking hard action. Also, currently daemons
   like senpai do actually need to modify the tree behind systemd's
   back (although hopefully this should be more integrated soon).

This patch implements another option, where one can, on demand, monitor
deviations in cgroup memory configuration from systemd's internal state.
Currently the only consumer is `systemd-analyze dump`, but the interface
is generic enough that it can also be exposed elsewhere later (for
example, over D-Bus).

Currently only memory limit style properties are supported, but later I
also plan to expand this out to other properties that systemd should
have ultimate control over.
2019-10-03 15:06:25 +01:00
Zbigniew Jędrzejewski-Szmek 86e94d95d0
Merge pull request #13246 from keszybz/add-SystemdOptions-efi-variable
Add efi variable to augment /proc/cmdline
2019-10-03 12:19:44 +02:00
Chris Down 64fe532e90 cgroup: Respect DefaultMemoryMin when setting memory.min
This is an oversight from https://github.com/systemd/systemd/pull/12332.

Sadly the tests didn't catch it since it requires a real cgroup
hierarchy to see, and it wasn't seen in prod since we're only currently
using DefaultMemoryLow, not DefaultMemoryMin. :-(
2019-09-30 18:41:21 +01:00
Chris Down 7c9d2b7993 cgroup: Check ancestor memory min for unified memory config
Otherwise we might not enable it when we should, ie. DefaultMemoryMin is
set in a parent, but not MemoryMin in the current unit.
2019-09-30 18:24:26 +01:00
Pavel Hrdina 047f5d63d7 cgroup: introduce support for cgroup v2 CPUSET controller
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller.  This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.

The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written.  However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well.  In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
2019-09-24 15:16:07 +02:00
Zbigniew Jędrzejewski-Szmek fdb3decaa7 util-lib: move some functions from basic/cgroup-util to shared/cgroup-setup
This way less stuff needs to be in basic. Initially, I wanted to move all the
parts of cgroup-utils.[ch] that depend on efivars.[ch] to shared, because
efivars.[ch] is in shared/. Later on, I decide to split efivars.[ch], so the
move done in this patch is not necessary anymore. Nevertheless, it is still
valid on its own. If at some point we want to expose libbasic, it is better to
to not have stuff that belong in libshared there.
2019-09-16 18:08:00 +02:00
Zbigniew Jędrzejewski-Szmek d4d99bc6e4 basic/cgroup-util: let cgroup_unified_flush() return the detected hierarchy
This avoid the use of the global variable.

Also rename cgroup_unified_update() to cgroup_unified_cached() and
cgroup_unified_flush() to cgroup_unified() to better reflect their new roles.
2019-09-16 18:06:20 +02:00
Kai Krakow 2dbc45aea7 cgroup: Also set io.bfq.weight
Current kernels with BFQ scheduler do not yet set their IO weight
through "io.weight" but through "io.bfq.weight" (using a slightly
different interface supporting only default weights, not per-device
weights). This commit enables "IOWeight=" to just to that.

This patch may be dropped at some time later.

Github-Link: https://github.com/systemd/systemd/issues/7057
Signed-off-by: Kai Krakow <kai@kaishome.de>
2019-08-20 11:50:59 +02:00
Zbigniew Jędrzejewski-Szmek a505166845
Merge pull request #13096 from keszybz/unit-loading
Preparatory work for the unit loading rework
2019-07-19 21:47:10 +02:00
Zbigniew Jędrzejewski-Szmek f4c43a8115 pid1: do not say "(null)" if no disabled controllers
It looks like we made a mistake. The list is just empty, that's all.
2019-07-19 16:51:14 +02:00
Zbigniew Jędrzejewski-Szmek 217b7b33cc pid1: order jobs that execute processes with lower priority
We can meaningfully compare jobs for units which have cpu weight or nice set.
But non-exec units those have those set.

Starting non-exec jobs first allows us to get them out of the queue quickly,
and consider more jobs for starting.

If we have service A, and socket B, and service C which is after socket B,
and we want to start both A and C, and C has higher cpu weight, if we get
B out of the way first, we'll know that we can start both A and C, and we'll
start C first.

Also invert the comparisons using CMP() so they are always done left vs. right,
and negate when returning instead.

Follow-up for da8e178296.
2019-07-19 14:38:52 +09:00
Michael Olbrich da8e178296 job: make the run queue order deterministic
Jobs are added to the run queue in random order. This happens because most
jobs are added by iterating over the transaction or dependency hash maps.

As a result, jobs that can be executed at the same time are started in a
different order each time.
On small embedded devices this can cause a measurable jitter for the point
in time when a job starts (~100ms jitter for 10 units that are started in
random order).
This results is a similar jitter for the boot time. This is undesirable in
general and make optimizing the boot time a lot harder.
Also, jobs that should have a higher priority because the unit has a higher
CPU weight might get executed later than others.

Fix this by turning the job run_queue into a Prioq and sort by the
following criteria (use the next if the values are equal):
- CPU weight
- nice level
- unit type
- unit name

The last one is just there for deterministic sorting to avoid any jitter.
2019-07-18 10:28:39 +02:00
Zbigniew Jędrzejewski-Szmek 95b21cff0e Apply empty_to_root() in three more spots for safety 2019-07-15 18:39:26 +02:00
Kai Lüke fab347489f bpf-firewall: custom BPF programs through IP(Ingress|Egress)FilterPath=
Takes a single /sys/fs/bpf/pinned_prog string as argument, but may be
specified multiple times. An empty assignment resets all previous filters.

Closes https://github.com/systemd/systemd/issues/10227
2019-06-25 09:56:16 +02:00
Yu Watanabe 270384b2d4 tree-wide: replace strjoina() with prefix_roota() 2019-06-25 01:31:26 +09:00
Lennart Poettering cee97d5768
Merge pull request #12836 from yuwata/tree-wide-replace-strjoin
tree-wide: replace strjoin() with path_join()
2019-06-22 20:02:46 +02:00
Yu Watanabe 657ee2d82b tree-wide: replace strjoin() with path_join() 2019-06-21 03:26:16 +09:00
Donald Buczek 0219b3524f cgroup: Continue unit reset if cgroup is busy
When part of the cgroup hierarchy cannot be deleted (e.g. because there
are still processes in it), do not exit unit_prune_cgroup early, but
continue so that u->cgroup_realized is reset.

Log the known case of non-empty cgroups at debug level and other errors
at warning level.

Fixes https://github.com/systemd/systemd/issues/12386
2019-06-20 10:16:53 +02:00
Chris Down c710d3b430 cgroup: Prevent theoretical nullptr deref in unit mask calculation 2019-06-07 06:33:53 +01:00
Zbigniew Jędrzejewski-Szmek 84d2744bc5 Move warning about unsupported BPF firewall right before the firewall would be created
There's no need to warn about the firewall when parsing, because the unit might
not be started at all. Let's warn only when we're actually preparing to start
the firewall.

This changes behaviour:
- the warning is printed just once for all unit types, and not once
  for normal units and once for transient units.
- on repeat warnings, the message is not printed at all. There's already
  detailed debug info from bpf_firewall_compile(), so we don't need to repeat
  ourselves.
- when we are not root, let's say precisely that, not "lack of necessary privileges"
  and "the local system does not support BPF/cgroup firewalling".

Fixes #12673.
2019-06-04 17:22:37 +02:00
Zbigniew Jędrzejewski-Szmek 2ba6ae6b2b core: do an extra check if oom was triggered when handling sigchild
Should fix #12425.
2019-05-20 16:37:06 +02:00
Ben Boeckel 5238e95759 codespell: fix spelling errors 2019-04-29 16:47:18 +02:00
Lennart Poettering d8974757c4
Merge pull request #12407 from keszybz/two-unrelated-cleanups
Two unrelated cleanups
2019-04-26 23:43:27 +02:00