Systemd

Author	SHA1	Message	Date
Zbigniew Jędrzejewski-Szmek	90e74a66e6	tree-wide: define iterator inside of the macro	2020-09-08 12:14:05 +02:00
Michal Koutný	d9ef594454	cgroup: Cleanup function usage Some masks shouldn't be needed externally, so keep their functions in the module (others would fit there too but they're used in tests) to think twice if something would depend on them. Drop unused function cg_attach_many_everywhere. Use cgroup_realized instead of cgroup_path when we actually ask for realized. This should not cause any functional changes.	2020-08-19 11:41:53 +02:00
Michal Koutný	12b975e065	cgroup: Reduce unit_get_ancestor_disable_mask use The usage in unit_get_own_mask is redundant, we only need apply disable_mask at the end befor application, i.e. calculating enable or target mask. (IOW, we allow all configurations, but disabling affects effective controls.) Modify tests accordingly and add testing of enable mask. This is intended as cleanup, with no effect but changing unit_dump output.	2020-08-19 11:41:53 +02:00
Michal Koutný	4c591f3996	cgroup: Introduce family queueing instead of siblings The unit_add_siblings_to_cgroup_realize_queue does more than mere siblings queueing, hence define a family of a unit as (immediate) children of the unit and immediate children of all ancestors. Working with this abstraction simplifies the queuing calls and it shouldn't change the functionality.	2020-08-19 11:41:53 +02:00
Michal Koutný	f23ba94db3	cgroup: Implicit unit_invalidate_cgroup_members_masks Merge members mask invalidation into unit_add_siblings_to_cgroup_realize_queue, this way unit_realize_cgroup needn't be called with members mask invalidation. We have to retain the members mask invalidation in unit_load -- although active units would have cgroups (re)realized (unit_load queues for realization), the realization would happen with potentially stale mask.	2020-08-19 11:41:53 +02:00
Michal Koutný	fb46fca7e0	cgroup: Eager realization in unit_free unit_free(u) realizes direct parent and invalidates members mask of all ancestors. This isn't sufficient in v1 controller hierarchies since siblings of the freed unit may have existed only because of the removed unit. We cannot be lazy about the siblings because if parent(u) is also removed, it'd migrate and rmdir cgroups for siblings(u). However, realized masks of siblings(u) won't reflect this change. This was a non-issue earlier, because we weren't removing cgroup directories properly (effectively matching the stale realized mask), removal failed because of tasks left by missing migration (see previous commit). Therefore, ensure realization of all units necessary to clean up after the free'd unit. Fixes: #14149	2020-08-19 11:41:53 +02:00
Michal Koutný	7b63961415	cgroup: Swap cgroup v1 deletion and migration When we are about to derealize a controller on v1 cgroup, we first attempt to delete the controller cgroup and migrate afterwards. This doesn't work in practice because populated cgroup cannot be deleted. Furthermore, we leave out slices from migration completely, so (un)setting a control value on them won't realize their controller cgroup. Rework actual realization, unit_create_cgroup() becomes unit_update_cgroup() and make sure that controller hierarchies are reduced when given controller cgroup ceased to be needed. Note that with this we introduce slight deviation between v1 and v2 code -- when a descendant unit turns off a delegated controller, we attempt to disable it in ancestor slices. On v2 this may fail (kernel enforced, because of child cgroups using the controller), on v1 we'll migrate whole subtree and trim the subhierachy. (Previously, we wouldn't take away delegated controller, however, derealization was broken anyway.) Fixes: #14149	2020-08-19 11:41:53 +02:00
Michal Koutný	30ad3ca086	cgroup: Add root slice to cgroup realization queue When we're disabling controller on a direct child of root cgroup, we forgot to add root slice into cgroup realization queue, which prevented proper disabling of the controller (on unified hierarchy). The mechanism relying on "bounce from bottom and propagate up" in unit_create_cgroup doesn't work on unified hierarchy (leaves needn't be enabled). Drop it as we rely on the ancestors to be queued -- that's now intentional but was artifact of combining the two patches: `cb5e3bc37d` ("cgroup: Don't explicitly check for member in UNIT_BEFORE") v240~78 `65f6b6bdcb` ("core: fix re-realization of cgroup siblings") v245-rc1~153^2 Fixes: #14917	2020-07-28 15:49:24 +02:00
Michal Koutný	a479c21ed2	cgroup: Make realize_queue behave FIFO The current implementation is LIFO, which is a) confusing b) prevents some ordered operations on the cgroup tree (e.g. removing children before parents). Fix it quickly. Current list implementation turns this from O(1) to O(n) operation. Rework the lists later.	2020-07-28 15:49:24 +02:00
Lennart Poettering	6b000af4f2	tree-wide: avoid some loaded terms https://tools.ietf.org/html/draft-knodel-terminology-02 https://lwn.net/Articles/823224/ This gets rid of most but not occasions of these loaded terms: 1. scsi_id and friends are something that is supposed to be removed from our tree (see #7594) 2. The test suite defines an API used by the ubuntu CI. We can remove this too later, but this needs to be done in sync with the ubuntu CI. 3. In some cases the terms are part of APIs we call or where we expose concepts the kernel names the way it names them. (In particular all remaining uses of the word "slave" in our codebase are like this, it's used by the POSIX PTY layer, by the network subsystem, the mount API and the block device subsystem). Getting rid of the term in these contexts would mean doing some major fixes of the kernel ABI first. Regarding the replacements: when whitelist/blacklist is used as noun we replace with with allow list/deny list, and when used as verb with allow-list/deny-list.	2020-06-25 09:00:19 +02:00
Michal Sekletar	d910f4c2b2	core/cgroup: fix return value of unit_cgorup_freezer_action() We should return 0 only if current freezer state, as reported by the kernel, is already the desired state. Otherwise, we would dispatch return dbus message prematurely in bus_unit_method_freezer_generic(). Thanks to Frantisek Sumsal for reporting the issue.	2020-05-07 22:19:19 +02:00
Michal Sekletár	d9e45bc3ab	core: introduce support for cgroup freezer With cgroup v2 the cgroup freezer is implemented as a cgroup attribute called cgroup.freeze. cgroup can be frozen by writing "1" to the file and kernel will send us a notification through "cgroup.events" after the operation is finished and processes in the cgroup entered quiescent state, i.e. they are not scheduled to run. Writing "0" to the attribute file does the inverse and process execution is resumed. This commit exposes above low-level functionality through systemd's DBus API. Each unit type must provide specialized implementation for these methods, otherwise, we return an error. So far only service, scope, and slice unit types provide the support. It is possible to check if a given unit has the support using CanFreeze() DBus property. Note that DBus API has a synchronous behavior and we dispatch the reply to freeze/thaw requests only after the kernel has notified us that requested operation was completed.	2020-04-30 19:02:51 +02:00
Anita Zhang	613328c3e2	cgroup-util: helper to cg_get_attribute and convert to uint64_t A common pattern in the codebase is reading a cgroup memory value and converting it to a uint64_t. Let's make it a helper and refactor a few places to use it so it's more concise.	2020-03-24 16:05:16 -07:00
Benjamin Berg	b7cf4b4ef5	core: Fix resolution of nested DM devices for cgroups When using the cgroups IO controller, the device that is controlled should always be the toplevel block device. This did not get resolved correctly for an LVM volume inside a LUKS device, because the code would only resolve one level of indirection. Fix this by recursively looking up the originating block device for DM devices. Resolves: #15008	2020-03-06 16:11:44 +00:00
Lennart Poettering	c238a2f889	cgroup: minor comment improvement As pointed out here: https://github.com/systemd/systemd/pull/14564#discussion_r366305882	2020-01-14 16:57:51 +01:00
Lennart Poettering	48fd01e5f3	cgroup: drop redundant if check	2020-01-14 10:44:58 +01:00
Lennart Poettering	e1e98911a8	cgroup: update only siblings that got realized once Fixes: #14475 Replaces: #14554	2020-01-14 10:44:19 +01:00
Lennart Poettering	95ae4d1420	cgroup: drop unnecessary {}	2020-01-14 10:44:19 +01:00
Lennart Poettering	a0d6590c4e	cgroup: no need to cast dev_t to dev_t	2020-01-14 10:44:19 +01:00
Lennart Poettering	57f1030b13	cgroup: use log_warning_errno() where possible	2020-01-14 10:44:19 +01:00
Lennart Poettering	65f6b6bdcb	core: fix re-realization of cgroup siblings This is a fix-up for `eef85c4a3f` which broke this. Tracked down by @w-simon Fixes: #14453	2020-01-09 17:31:41 +01:00
Lennart Poettering	3288ea8f32	core: set "trusted.delegate" xattr on cgroups that are delegation boundaries Let's mark cgroups that are delegation boundaries to us. This can then be used by tools such as "systemd-cgls" to show where the next manager takes over.	2019-11-20 17:50:12 +01:00
Zbigniew Jędrzejewski-Szmek	3a0f06c41a	core: make TasksMax a partially dynamic property TasksMax= and DefaultTasksMax= can be specified as percentages. We don't actually document of what the percentage is relative to, but the implementation uses the smallest of /proc/sys/kernel/pid_max, /proc/sys/kernel/threads-max, and /sys/fs/cgroup/pids.max (when present). When the value is a percentage, we immediately convert it to an absolute value. If the limit later changes (which can happen e.g. when systemd-sysctl runs), the absolute value becomes outdated. So let's store either the percentage or absolute value, whatever was specified, and only convert to an absolute value when the value is used. For example, when starting a unit, the absolute value will be calculated when the cgroup for the unit is created. Fixes #13419.	2019-11-14 18:41:54 +01:00
Zbigniew Jędrzejewski-Szmek	45669ae264	bpf: make sure the kernel do not submit an invalid program if no pattern matched It turns out that the kernel verifier would reject a program we would build if there was a whitelist, but no entries in the whitelist matched. The program would approximately like this: 0: (61) r2 = (u32 )(r1 +0) 1: (54) w2 &= 65535 2: (61) r3 = (u32 )(r1 +0) 3: (74) w3 >>= 16 4: (61) r4 = (u32 )(r1 +4) 5: (61) r5 = (u32 )(r1 +8) 48: (b7) r0 = 0 49: (05) goto pc+1 50: (b7) r0 = 1 51: (95) exit and insn 50 is unreachable, which is illegal. We would then either keep a previous version of the program or allow everything. Make sure we build a valid program that simply rejects everything.	2019-11-11 15:14:09 +01:00
Zbigniew Jędrzejewski-Szmek	0848715cab	bpf: make bpf_devices_apply_policy() independent of any unit code	2019-11-11 14:55:57 +01:00
Zbigniew Jędrzejewski-Szmek	8b139557fe	core: split out one more function	2019-11-11 14:55:52 +01:00
Zbigniew Jędrzejewski-Szmek	a9aac7d8dd	core: also split out helper to handle static device nodes	2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek	124e05b3b6	core: move bpf devices implementation to bpf-devices.[ch] and rename The naming of the functions was a complete mess: the most specific functions which don't know anything about cgroups had "cgroup_" prefix, while more general functions which took a node path and a cgroup for reporting had no prefix. Let's use "bpf_devices_" for the latter group, and "bpf_prog_*" for the rest. The main goal of this move is to split the implementation from the calling code and add unit tests in a later patch.	2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek	084870f9c0	core: rename CGROUP_AUTO/STRICT/CLOSED to CGROUP_DEVICE_POLICY_… The old names were very generic, and when used without context it wasn't at all clear that they are about the devices policy.	2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek	672cbcbc20	bpf: return normally from whitelist_major() All callers do (void) anyway, so we can just use normal return here.	2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek	d49c180826	bpf: do not bother adding device patterns after whitelisting the full class This seems to have been unintentional.	2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek	fa6613fc53	bpf: refactor how we create device major:minor whitelists No functional change intended except for minor adjustments to error messages.	2019-11-10 23:22:15 +01:00
Lennart Poettering	c259ac9aa2	cpuset: fix indentation and log about OOM we otherwise ignore	2019-11-01 10:21:53 +01:00
Lennart Poettering	85c3b27891	cgroup: add some basic OOM safety where it was missing	2019-11-01 10:21:35 +01:00
Zbigniew Jędrzejewski-Szmek	2cea199ec1	core: pass around pointer, not struct Since this is a static function, the compiler is likely to optimize it away anyway, but let's do the normal thing here.	2019-10-11 13:46:05 +02:00
Chris Down	bc0623df16	cgroup: analyze: Report memory configurations that deviate from systemd This is the most basic consumer of the new systemd-vs-kernel checker, both acting as a reasonable standalone exerciser of the code, and also as a way for easy inspection of deviations from systemd internal state.	2019-10-03 15:06:25 +01:00
Chris Down	6dfb92823f	cgroup: analyze: Match standard dump format We're the only ones left using = as the delimiter, which looks really weird in `systemd-analyze dump`. Use `: ` like everyone else.	2019-10-03 15:06:25 +01:00
Chris Down	74b5fb272f	cgroup: Allow checking systemd-internal limits against the kernel We currently don't have any mitigations against another privileged user on the system messing with the cgroup hierarchy, bringing the system out of line with what we've set in systemd. We also don't have any real way to surface this to the user (we do have logs, but you have to know to look in the first place). There are a few possible solutions: 1. Maintaining our own cgroup tree with the new fsopen API and having a read-only copy for everyone else. However, there are some complications on this front, and this may be infeasible in some environments. I'd rate this as a longer term effort that's tangential to this patch. 2. Actively checking for changes with {fa,i}notify and changing them back afterwards to match our configuration again. This is also possible, but it's also good to have a way to do passive monitoring of the situation without taking hard action. Also, currently daemons like senpai do actually need to modify the tree behind systemd's back (although hopefully this should be more integrated soon). This patch implements another option, where one can, on demand, monitor deviations in cgroup memory configuration from systemd's internal state. Currently the only consumer is `systemd-analyze dump`, but the interface is generic enough that it can also be exposed elsewhere later (for example, over D-Bus). Currently only memory limit style properties are supported, but later I also plan to expand this out to other properties that systemd should have ultimate control over.	2019-10-03 15:06:25 +01:00
Zbigniew Jędrzejewski-Szmek	86e94d95d0	Merge pull request #13246 from keszybz/add-SystemdOptions-efi-variable Add efi variable to augment /proc/cmdline	2019-10-03 12:19:44 +02:00
Chris Down	64fe532e90	cgroup: Respect DefaultMemoryMin when setting memory.min This is an oversight from https://github.com/systemd/systemd/pull/12332. Sadly the tests didn't catch it since it requires a real cgroup hierarchy to see, and it wasn't seen in prod since we're only currently using DefaultMemoryLow, not DefaultMemoryMin. :-(	2019-09-30 18:41:21 +01:00
Chris Down	7c9d2b7993	cgroup: Check ancestor memory min for unified memory config Otherwise we might not enable it when we should, ie. DefaultMemoryMin is set in a parent, but not MemoryMin in the current unit.	2019-09-30 18:24:26 +01:00
Pavel Hrdina	047f5d63d7	cgroup: introduce support for cgroup v2 CPUSET controller Introduce support for configuring cpus and mems for processes using cgroup v2 CPUSET controller. This allows users to limit which cpus and memory NUMA nodes can be used by processes to better utilize system resources. The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems where the requested configuration is written. However, it doesn't mean that the requested configuration will be actually used as parent cgroup may limit the cpus or mems as well. In order to reflect the real configuration cgroup v2 provides read-only files cpuset.cpus.effective and cpuset.mems.effective which are exported to users as well.	2019-09-24 15:16:07 +02:00
Zbigniew Jędrzejewski-Szmek	fdb3decaa7	util-lib: move some functions from basic/cgroup-util to shared/cgroup-setup This way less stuff needs to be in basic. Initially, I wanted to move all the parts of cgroup-utils.[ch] that depend on efivars.[ch] to shared, because efivars.[ch] is in shared/. Later on, I decide to split efivars.[ch], so the move done in this patch is not necessary anymore. Nevertheless, it is still valid on its own. If at some point we want to expose libbasic, it is better to to not have stuff that belong in libshared there.	2019-09-16 18:08:00 +02:00
Zbigniew Jędrzejewski-Szmek	d4d99bc6e4	basic/cgroup-util: let cgroup_unified_flush() return the detected hierarchy This avoid the use of the global variable. Also rename cgroup_unified_update() to cgroup_unified_cached() and cgroup_unified_flush() to cgroup_unified() to better reflect their new roles.	2019-09-16 18:06:20 +02:00
Kai Krakow	2dbc45aea7	cgroup: Also set io.bfq.weight Current kernels with BFQ scheduler do not yet set their IO weight through "io.weight" but through "io.bfq.weight" (using a slightly different interface supporting only default weights, not per-device weights). This commit enables "IOWeight=" to just to that. This patch may be dropped at some time later. Github-Link: https://github.com/systemd/systemd/issues/7057 Signed-off-by: Kai Krakow <kai@kaishome.de>	2019-08-20 11:50:59 +02:00
Zbigniew Jędrzejewski-Szmek	a505166845	Merge pull request #13096 from keszybz/unit-loading Preparatory work for the unit loading rework	2019-07-19 21:47:10 +02:00
Zbigniew Jędrzejewski-Szmek	f4c43a8115	pid1: do not say "(null)" if no disabled controllers It looks like we made a mistake. The list is just empty, that's all.	2019-07-19 16:51:14 +02:00
Zbigniew Jędrzejewski-Szmek	217b7b33cc	pid1: order jobs that execute processes with lower priority We can meaningfully compare jobs for units which have cpu weight or nice set. But non-exec units those have those set. Starting non-exec jobs first allows us to get them out of the queue quickly, and consider more jobs for starting. If we have service A, and socket B, and service C which is after socket B, and we want to start both A and C, and C has higher cpu weight, if we get B out of the way first, we'll know that we can start both A and C, and we'll start C first. Also invert the comparisons using CMP() so they are always done left vs. right, and negate when returning instead. Follow-up for `da8e178296`.	2019-07-19 14:38:52 +09:00
Michael Olbrich	da8e178296	job: make the run queue order deterministic Jobs are added to the run queue in random order. This happens because most jobs are added by iterating over the transaction or dependency hash maps. As a result, jobs that can be executed at the same time are started in a different order each time. On small embedded devices this can cause a measurable jitter for the point in time when a job starts (~100ms jitter for 10 units that are started in random order). This results is a similar jitter for the boot time. This is undesirable in general and make optimizing the boot time a lot harder. Also, jobs that should have a higher priority because the unit has a higher CPU weight might get executed later than others. Fix this by turning the job run_queue into a Prioq and sort by the following criteria (use the next if the values are equal): - CPU weight - nice level - unit type - unit name The last one is just there for deterministic sorting to avoid any jitter.	2019-07-18 10:28:39 +02:00
Zbigniew Jędrzejewski-Szmek	95b21cff0e	Apply empty_to_root() in three more spots for safety	2019-07-15 18:39:26 +02:00

1 2 3 4 5 ...

350 commits