Commit graph

354 commits

Author SHA1 Message Date
Lennart Poettering 065b47749d tree-wide: use ERRNO_IS_PRIVILEGE() whereever appropriate 2020-09-22 16:25:22 +02:00
Lennart Poettering 244d9793ee
Merge pull request #16984 from yuwata/make-log_xxx_error-void
Make log_xxx_error() or friends return void
2020-09-09 16:28:51 +02:00
Michal Sekletár 9a1e90aee5 cgroup: freezer action must be NOP when cgroup v2 freezer is not available
Low-level cgroup freezer state manipulation is invoked directly from the
job engine when we are about to execute the job in order to make sure
the unit is not frozen and job execution is not blocked because of
that.

Currently with cgroup v1 we would needlessly do a bunch of work in the
function and even falsely update the freezer state. Don't do any of this
and skip the function silently when v2 freezer is not available.

Following bug is fixed by this commit,

$ systemd-run --unit foo.service /bin/sleep infinity
$ systemctl restart foo.service
$ systemctl show -p FreezerState foo.service

Before (cgroup v1, i.e. full "legacy" mode):
FreezerState=thawing

After:
FreezerState=running
2020-09-08 19:54:13 +02:00
Yu Watanabe 8ed6f81ba3 core: make log_unit_error() or friends return void 2020-09-09 02:34:38 +09:00
Zbigniew Jędrzejewski-Szmek 90e74a66e6 tree-wide: define iterator inside of the macro 2020-09-08 12:14:05 +02:00
Michal Koutný d9ef594454 cgroup: Cleanup function usage
Some masks shouldn't be needed externally, so keep their functions in
the module (others would fit there too but they're used in tests) to
think twice if something would depend on them.

Drop unused function cg_attach_many_everywhere.

Use cgroup_realized instead of cgroup_path when we actually ask for
realized.

This should not cause any functional changes.
2020-08-19 11:41:53 +02:00
Michal Koutný 12b975e065 cgroup: Reduce unit_get_ancestor_disable_mask use
The usage in unit_get_own_mask is redundant, we only need apply
disable_mask at the end befor application, i.e. calculating enable or
target mask.

(IOW, we allow all configurations, but disabling affects effective
controls.)

Modify tests accordingly and add testing of enable mask.

This is intended as cleanup, with no effect but changing unit_dump
output.
2020-08-19 11:41:53 +02:00
Michal Koutný 4c591f3996 cgroup: Introduce family queueing instead of siblings
The unit_add_siblings_to_cgroup_realize_queue does more than mere
siblings queueing, hence define a family of a unit as (immediate)
children of the unit and immediate children of all ancestors.

Working with this abstraction simplifies the queuing calls and it
shouldn't change the functionality.
2020-08-19 11:41:53 +02:00
Michal Koutný f23ba94db3 cgroup: Implicit unit_invalidate_cgroup_members_masks
Merge members mask invalidation into
unit_add_siblings_to_cgroup_realize_queue, this way unit_realize_cgroup
needn't be called with members mask invalidation.

We have to retain the members mask invalidation in unit_load -- although
active units would have cgroups (re)realized (unit_load queues for
realization), the realization would happen with potentially stale mask.
2020-08-19 11:41:53 +02:00
Michal Koutný fb46fca7e0 cgroup: Eager realization in unit_free
unit_free(u) realizes direct parent and invalidates members mask of all
ancestors. This isn't sufficient in v1 controller hierarchies since
siblings of the freed unit may have existed only because of the removed
unit.

We cannot be lazy about the siblings because if parent(u) is also
removed, it'd migrate and rmdir cgroups for siblings(u). However,
realized masks of siblings(u) won't reflect this change.

This was a non-issue earlier, because we weren't removing cgroup
directories properly (effectively matching the stale realized mask),
removal failed because of tasks left by missing migration (see previous
commit).

Therefore, ensure realization of all units necessary to clean up after
the free'd unit.

Fixes: #14149
2020-08-19 11:41:53 +02:00
Michal Koutný 7b63961415 cgroup: Swap cgroup v1 deletion and migration
When we are about to derealize a controller on v1 cgroup, we first
attempt to delete the controller cgroup and migrate afterwards. This
doesn't work in practice because populated cgroup cannot be deleted.
Furthermore, we leave out slices from migration completely, so
(un)setting a control value on them won't realize their controller
cgroup.

Rework actual realization, unit_create_cgroup() becomes
unit_update_cgroup() and make sure that controller hierarchies are
reduced when given controller cgroup ceased to be needed.

Note that with this we introduce slight deviation between v1 and v2 code
-- when a descendant unit turns off a delegated controller, we attempt
to disable it in ancestor slices. On v2 this may fail (kernel enforced,
because of child cgroups using the controller), on v1 we'll migrate
whole subtree and trim the subhierachy. (Previously, we wouldn't take
away delegated controller, however, derealization was broken anyway.)

Fixes: #14149
2020-08-19 11:41:53 +02:00
Michal Koutný 30ad3ca086 cgroup: Add root slice to cgroup realization queue
When we're disabling controller on a direct child of root cgroup, we
forgot to add root slice into cgroup realization queue, which prevented
proper disabling of the controller (on unified hierarchy).

The mechanism relying on "bounce from bottom and propagate up" in
unit_create_cgroup doesn't work on unified hierarchy (leaves needn't be
enabled). Drop it as we rely on the ancestors to be queued -- that's now
intentional but was artifact of combining the two patches:

cb5e3bc37d ("cgroup: Don't explicitly check for member in UNIT_BEFORE") v240~78
65f6b6bdcb ("core: fix re-realization of cgroup siblings") v245-rc1~153^2

Fixes: #14917
2020-07-28 15:49:24 +02:00
Michal Koutný a479c21ed2 cgroup: Make realize_queue behave FIFO
The current implementation is LIFO, which is a) confusing b) prevents
some ordered operations on the cgroup tree (e.g. removing children
before parents).

Fix it quickly. Current list implementation turns this from O(1) to O(n)
operation. Rework the lists later.
2020-07-28 15:49:24 +02:00
Lennart Poettering 6b000af4f2 tree-wide: avoid some loaded terms
https://tools.ietf.org/html/draft-knodel-terminology-02
https://lwn.net/Articles/823224/

This gets rid of most but not occasions of these loaded terms:

1. scsi_id and friends are something that is supposed to be removed from
   our tree (see #7594)

2. The test suite defines an API used by the ubuntu CI. We can remove
   this too later, but this needs to be done in sync with the ubuntu CI.

3. In some cases the terms are part of APIs we call or where we expose
   concepts the kernel names the way it names them. (In particular all
   remaining uses of the word "slave" in our codebase are like this,
   it's used by the POSIX PTY layer, by the network subsystem, the mount
   API and the block device subsystem). Getting rid of the term in these
   contexts would mean doing some major fixes of the kernel ABI first.

Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
2020-06-25 09:00:19 +02:00
Michal Sekletar d910f4c2b2 core/cgroup: fix return value of unit_cgorup_freezer_action()
We should return 0 only if current freezer state, as reported by the
kernel, is already the desired state. Otherwise, we would dispatch
return dbus message prematurely in bus_unit_method_freezer_generic().

Thanks to Frantisek Sumsal for reporting the issue.
2020-05-07 22:19:19 +02:00
Michal Sekletár d9e45bc3ab core: introduce support for cgroup freezer
With cgroup v2 the cgroup freezer is implemented as a cgroup
attribute called cgroup.freeze. cgroup can be frozen by writing "1"
to the file and kernel will send us a notification through
"cgroup.events" after the operation is finished and processes in the
cgroup entered quiescent state, i.e. they are not scheduled to
run. Writing "0" to the attribute file does the inverse and process
execution is resumed.

This commit exposes above low-level functionality through systemd's DBus
API. Each unit type must provide specialized implementation for these
methods, otherwise, we return an error. So far only service, scope, and
slice unit types provide the support. It is possible to check if a
given unit has the support using CanFreeze() DBus property.

Note that DBus API has a synchronous behavior and we dispatch the reply
to freeze/thaw requests only after the kernel has notified us that
requested operation was completed.
2020-04-30 19:02:51 +02:00
Anita Zhang 613328c3e2 cgroup-util: helper to cg_get_attribute and convert to uint64_t
A common pattern in the codebase is reading a cgroup memory value
and converting it to a uint64_t. Let's make it a helper and refactor a
few places to use it so it's more concise.
2020-03-24 16:05:16 -07:00
Benjamin Berg b7cf4b4ef5 core: Fix resolution of nested DM devices for cgroups
When using the cgroups IO controller, the device that is controlled
should always be the toplevel block device. This did not get resolved
correctly for an LVM volume inside a LUKS device, because the code would
only resolve one level of indirection.

Fix this by recursively looking up the originating block device for DM
devices.

Resolves: #15008
2020-03-06 16:11:44 +00:00
Lennart Poettering c238a2f889 cgroup: minor comment improvement
As pointed out here:

https://github.com/systemd/systemd/pull/14564#discussion_r366305882
2020-01-14 16:57:51 +01:00
Lennart Poettering 48fd01e5f3 cgroup: drop redundant if check 2020-01-14 10:44:58 +01:00
Lennart Poettering e1e98911a8 cgroup: update only siblings that got realized once
Fixes: #14475
Replaces: #14554
2020-01-14 10:44:19 +01:00
Lennart Poettering 95ae4d1420 cgroup: drop unnecessary {} 2020-01-14 10:44:19 +01:00
Lennart Poettering a0d6590c4e cgroup: no need to cast dev_t to dev_t 2020-01-14 10:44:19 +01:00
Lennart Poettering 57f1030b13 cgroup: use log_warning_errno() where possible 2020-01-14 10:44:19 +01:00
Lennart Poettering 65f6b6bdcb core: fix re-realization of cgroup siblings
This is a fix-up for eef85c4a3f which
broke this.

Tracked down by @w-simon

Fixes: #14453
2020-01-09 17:31:41 +01:00
Lennart Poettering 3288ea8f32 core: set "trusted.delegate" xattr on cgroups that are delegation boundaries
Let's mark cgroups that are delegation boundaries to us. This can then
be used by tools such as "systemd-cgls" to show where the next manager
takes over.
2019-11-20 17:50:12 +01:00
Zbigniew Jędrzejewski-Szmek 3a0f06c41a core: make TasksMax a partially dynamic property
TasksMax= and DefaultTasksMax= can be specified as percentages. We don't
actually document of what the percentage is relative to, but the implementation
uses the smallest of /proc/sys/kernel/pid_max, /proc/sys/kernel/threads-max,
and /sys/fs/cgroup/pids.max (when present). When the value is a percentage,
we immediately convert it to an absolute value. If the limit later changes
(which can happen e.g. when systemd-sysctl runs), the absolute value becomes
outdated.

So let's store either the percentage or absolute value, whatever was specified,
and only convert to an absolute value when the value is used. For example, when
starting a unit, the absolute value will be calculated when the cgroup for
the unit is created.

Fixes #13419.
2019-11-14 18:41:54 +01:00
Zbigniew Jędrzejewski-Szmek 45669ae264 bpf: make sure the kernel do not submit an invalid program if no pattern matched
It turns out that the kernel verifier would reject a program we would build
if there was a whitelist, but no entries in the whitelist matched.
The program would approximately like this:
   0: (61) r2 = *(u32 *)(r1 +0)
   1: (54) w2 &= 65535
   2: (61) r3 = *(u32 *)(r1 +0)
   3: (74) w3 >>= 16
   4: (61) r4 = *(u32 *)(r1 +4)
   5: (61) r5 = *(u32 *)(r1 +8)
  48: (b7) r0 = 0
  49: (05) goto pc+1
  50: (b7) r0 = 1
  51: (95) exit
and insn 50 is unreachable, which is illegal. We would then either keep a
previous version of the program or allow everything. Make sure we build a
valid program that simply rejects everything.
2019-11-11 15:14:09 +01:00
Zbigniew Jędrzejewski-Szmek 0848715cab bpf: make bpf_devices_apply_policy() independent of any unit code 2019-11-11 14:55:57 +01:00
Zbigniew Jędrzejewski-Szmek 8b139557fe core: split out one more function 2019-11-11 14:55:52 +01:00
Zbigniew Jędrzejewski-Szmek a9aac7d8dd core: also split out helper to handle static device nodes 2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek 124e05b3b6 core: move bpf devices implementation to bpf-devices.[ch] and rename
The naming of the functions was a complete mess: the most specific functions
which don't know anything about cgroups had "cgroup_" prefix, while more
general functions which took a node path and a cgroup for reporting had no
prefix. Let's use "bpf_devices_" for the latter group, and "bpf_prog_*" for the
rest.

The main goal of this move is to split the implementation from the calling code
and add unit tests in a later patch.
2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek 084870f9c0 core: rename CGROUP_AUTO/STRICT/CLOSED to CGROUP_DEVICE_POLICY_…
The old names were very generic, and when used without context it wasn't at all
clear that they are about the devices policy.
2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek 672cbcbc20 bpf: return normally from whitelist_major()
All callers do (void) anyway, so we can just use normal return here.
2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek d49c180826 bpf: do not bother adding device patterns after whitelisting the full class
This seems to have been unintentional.
2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek fa6613fc53 bpf: refactor how we create device major:minor whitelists
No functional change intended except for minor adjustments to error messages.
2019-11-10 23:22:15 +01:00
Lennart Poettering c259ac9aa2 cpuset: fix indentation and log about OOM we otherwise ignore 2019-11-01 10:21:53 +01:00
Lennart Poettering 85c3b27891 cgroup: add some basic OOM safety where it was missing 2019-11-01 10:21:35 +01:00
Zbigniew Jędrzejewski-Szmek 2cea199ec1 core: pass around pointer, not struct
Since this is a static function, the compiler is likely to optimize it away
anyway, but let's do the normal thing here.
2019-10-11 13:46:05 +02:00
Chris Down bc0623df16 cgroup: analyze: Report memory configurations that deviate from systemd
This is the most basic consumer of the new systemd-vs-kernel checker,
both acting as a reasonable standalone exerciser of the code, and also
as a way for easy inspection of deviations from systemd internal state.
2019-10-03 15:06:25 +01:00
Chris Down 6dfb92823f cgroup: analyze: Match standard dump format
We're the only ones left using = as the delimiter, which looks really
weird in `systemd-analyze dump`. Use `: ` like everyone else.
2019-10-03 15:06:25 +01:00
Chris Down 74b5fb272f cgroup: Allow checking systemd-internal limits against the kernel
We currently don't have any mitigations against another privileged user
on the system messing with the cgroup hierarchy, bringing the system out
of line with what we've set in systemd. We also don't have any real way
to surface this to the user (we do have logs, but you have to know to
look in the first place).

There are a few possible solutions:

1. Maintaining our own cgroup tree with the new fsopen API and having a
   read-only copy for everyone else. However, there are some
   complications on this front, and this may be infeasible in some
   environments. I'd rate this as a longer term effort that's tangential
   to this patch.
2. Actively checking for changes with {fa,i}notify and changing them
   back afterwards to match our configuration again. This is also
   possible, but it's also good to have a way to do passive monitoring
   of the situation without taking hard action. Also, currently daemons
   like senpai do actually need to modify the tree behind systemd's
   back (although hopefully this should be more integrated soon).

This patch implements another option, where one can, on demand, monitor
deviations in cgroup memory configuration from systemd's internal state.
Currently the only consumer is `systemd-analyze dump`, but the interface
is generic enough that it can also be exposed elsewhere later (for
example, over D-Bus).

Currently only memory limit style properties are supported, but later I
also plan to expand this out to other properties that systemd should
have ultimate control over.
2019-10-03 15:06:25 +01:00
Zbigniew Jędrzejewski-Szmek 86e94d95d0
Merge pull request #13246 from keszybz/add-SystemdOptions-efi-variable
Add efi variable to augment /proc/cmdline
2019-10-03 12:19:44 +02:00
Chris Down 64fe532e90 cgroup: Respect DefaultMemoryMin when setting memory.min
This is an oversight from https://github.com/systemd/systemd/pull/12332.

Sadly the tests didn't catch it since it requires a real cgroup
hierarchy to see, and it wasn't seen in prod since we're only currently
using DefaultMemoryLow, not DefaultMemoryMin. :-(
2019-09-30 18:41:21 +01:00
Chris Down 7c9d2b7993 cgroup: Check ancestor memory min for unified memory config
Otherwise we might not enable it when we should, ie. DefaultMemoryMin is
set in a parent, but not MemoryMin in the current unit.
2019-09-30 18:24:26 +01:00
Pavel Hrdina 047f5d63d7 cgroup: introduce support for cgroup v2 CPUSET controller
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller.  This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.

The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written.  However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well.  In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
2019-09-24 15:16:07 +02:00
Zbigniew Jędrzejewski-Szmek fdb3decaa7 util-lib: move some functions from basic/cgroup-util to shared/cgroup-setup
This way less stuff needs to be in basic. Initially, I wanted to move all the
parts of cgroup-utils.[ch] that depend on efivars.[ch] to shared, because
efivars.[ch] is in shared/. Later on, I decide to split efivars.[ch], so the
move done in this patch is not necessary anymore. Nevertheless, it is still
valid on its own. If at some point we want to expose libbasic, it is better to
to not have stuff that belong in libshared there.
2019-09-16 18:08:00 +02:00
Zbigniew Jędrzejewski-Szmek d4d99bc6e4 basic/cgroup-util: let cgroup_unified_flush() return the detected hierarchy
This avoid the use of the global variable.

Also rename cgroup_unified_update() to cgroup_unified_cached() and
cgroup_unified_flush() to cgroup_unified() to better reflect their new roles.
2019-09-16 18:06:20 +02:00
Kai Krakow 2dbc45aea7 cgroup: Also set io.bfq.weight
Current kernels with BFQ scheduler do not yet set their IO weight
through "io.weight" but through "io.bfq.weight" (using a slightly
different interface supporting only default weights, not per-device
weights). This commit enables "IOWeight=" to just to that.

This patch may be dropped at some time later.

Github-Link: https://github.com/systemd/systemd/issues/7057
Signed-off-by: Kai Krakow <kai@kaishome.de>
2019-08-20 11:50:59 +02:00
Zbigniew Jędrzejewski-Szmek a505166845
Merge pull request #13096 from keszybz/unit-loading
Preparatory work for the unit loading rework
2019-07-19 21:47:10 +02:00