Systemd

Commit Graph

Author	SHA1	Message	Date
Michal Sekletar	d910f4c2b2	core/cgroup: fix return value of unit_cgorup_freezer_action() We should return 0 only if current freezer state, as reported by the kernel, is already the desired state. Otherwise, we would dispatch return dbus message prematurely in bus_unit_method_freezer_generic(). Thanks to Frantisek Sumsal for reporting the issue.	2020-05-07 22:19:19 +02:00
Michal Sekletár	d9e45bc3ab	core: introduce support for cgroup freezer With cgroup v2 the cgroup freezer is implemented as a cgroup attribute called cgroup.freeze. cgroup can be frozen by writing "1" to the file and kernel will send us a notification through "cgroup.events" after the operation is finished and processes in the cgroup entered quiescent state, i.e. they are not scheduled to run. Writing "0" to the attribute file does the inverse and process execution is resumed. This commit exposes above low-level functionality through systemd's DBus API. Each unit type must provide specialized implementation for these methods, otherwise, we return an error. So far only service, scope, and slice unit types provide the support. It is possible to check if a given unit has the support using CanFreeze() DBus property. Note that DBus API has a synchronous behavior and we dispatch the reply to freeze/thaw requests only after the kernel has notified us that requested operation was completed.	2020-04-30 19:02:51 +02:00
Anita Zhang	613328c3e2	cgroup-util: helper to cg_get_attribute and convert to uint64_t A common pattern in the codebase is reading a cgroup memory value and converting it to a uint64_t. Let's make it a helper and refactor a few places to use it so it's more concise.	2020-03-24 16:05:16 -07:00
Benjamin Berg	b7cf4b4ef5	core: Fix resolution of nested DM devices for cgroups When using the cgroups IO controller, the device that is controlled should always be the toplevel block device. This did not get resolved correctly for an LVM volume inside a LUKS device, because the code would only resolve one level of indirection. Fix this by recursively looking up the originating block device for DM devices. Resolves: #15008	2020-03-06 16:11:44 +00:00
Lennart Poettering	c238a2f889	cgroup: minor comment improvement As pointed out here: https://github.com/systemd/systemd/pull/14564#discussion_r366305882	2020-01-14 16:57:51 +01:00
Lennart Poettering	48fd01e5f3	cgroup: drop redundant if check	2020-01-14 10:44:58 +01:00
Lennart Poettering	e1e98911a8	cgroup: update only siblings that got realized once Fixes: #14475 Replaces: #14554	2020-01-14 10:44:19 +01:00
Lennart Poettering	95ae4d1420	cgroup: drop unnecessary {}	2020-01-14 10:44:19 +01:00
Lennart Poettering	a0d6590c4e	cgroup: no need to cast dev_t to dev_t	2020-01-14 10:44:19 +01:00
Lennart Poettering	57f1030b13	cgroup: use log_warning_errno() where possible	2020-01-14 10:44:19 +01:00
Lennart Poettering	65f6b6bdcb	core: fix re-realization of cgroup siblings This is a fix-up for `eef85c4a3f` which broke this. Tracked down by @w-simon Fixes: #14453	2020-01-09 17:31:41 +01:00
Lennart Poettering	3288ea8f32	core: set "trusted.delegate" xattr on cgroups that are delegation boundaries Let's mark cgroups that are delegation boundaries to us. This can then be used by tools such as "systemd-cgls" to show where the next manager takes over.	2019-11-20 17:50:12 +01:00
Zbigniew Jędrzejewski-Szmek	3a0f06c41a	core: make TasksMax a partially dynamic property TasksMax= and DefaultTasksMax= can be specified as percentages. We don't actually document of what the percentage is relative to, but the implementation uses the smallest of /proc/sys/kernel/pid_max, /proc/sys/kernel/threads-max, and /sys/fs/cgroup/pids.max (when present). When the value is a percentage, we immediately convert it to an absolute value. If the limit later changes (which can happen e.g. when systemd-sysctl runs), the absolute value becomes outdated. So let's store either the percentage or absolute value, whatever was specified, and only convert to an absolute value when the value is used. For example, when starting a unit, the absolute value will be calculated when the cgroup for the unit is created. Fixes #13419.	2019-11-14 18:41:54 +01:00
Zbigniew Jędrzejewski-Szmek	45669ae264	bpf: make sure the kernel do not submit an invalid program if no pattern matched It turns out that the kernel verifier would reject a program we would build if there was a whitelist, but no entries in the whitelist matched. The program would approximately like this: 0: (61) r2 = (u32 )(r1 +0) 1: (54) w2 &= 65535 2: (61) r3 = (u32 )(r1 +0) 3: (74) w3 >>= 16 4: (61) r4 = (u32 )(r1 +4) 5: (61) r5 = (u32 )(r1 +8) 48: (b7) r0 = 0 49: (05) goto pc+1 50: (b7) r0 = 1 51: (95) exit and insn 50 is unreachable, which is illegal. We would then either keep a previous version of the program or allow everything. Make sure we build a valid program that simply rejects everything.	2019-11-11 15:14:09 +01:00
Zbigniew Jędrzejewski-Szmek	0848715cab	bpf: make bpf_devices_apply_policy() independent of any unit code	2019-11-11 14:55:57 +01:00
Zbigniew Jędrzejewski-Szmek	8b139557fe	core: split out one more function	2019-11-11 14:55:52 +01:00
Zbigniew Jędrzejewski-Szmek	a9aac7d8dd	core: also split out helper to handle static device nodes	2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek	124e05b3b6	core: move bpf devices implementation to bpf-devices.[ch] and rename The naming of the functions was a complete mess: the most specific functions which don't know anything about cgroups had "cgroup_" prefix, while more general functions which took a node path and a cgroup for reporting had no prefix. Let's use "bpf_devices_" for the latter group, and "bpf_prog_*" for the rest. The main goal of this move is to split the implementation from the calling code and add unit tests in a later patch.	2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek	084870f9c0	core: rename CGROUP_AUTO/STRICT/CLOSED to CGROUP_DEVICE_POLICY_… The old names were very generic, and when used without context it wasn't at all clear that they are about the devices policy.	2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek	672cbcbc20	bpf: return normally from whitelist_major() All callers do (void) anyway, so we can just use normal return here.	2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek	d49c180826	bpf: do not bother adding device patterns after whitelisting the full class This seems to have been unintentional.	2019-11-10 23:22:15 +01:00
Zbigniew Jędrzejewski-Szmek	fa6613fc53	bpf: refactor how we create device major:minor whitelists No functional change intended except for minor adjustments to error messages.	2019-11-10 23:22:15 +01:00
Lennart Poettering	c259ac9aa2	cpuset: fix indentation and log about OOM we otherwise ignore	2019-11-01 10:21:53 +01:00
Lennart Poettering	85c3b27891	cgroup: add some basic OOM safety where it was missing	2019-11-01 10:21:35 +01:00
Zbigniew Jędrzejewski-Szmek	2cea199ec1	core: pass around pointer, not struct Since this is a static function, the compiler is likely to optimize it away anyway, but let's do the normal thing here.	2019-10-11 13:46:05 +02:00
Chris Down	bc0623df16	cgroup: analyze: Report memory configurations that deviate from systemd This is the most basic consumer of the new systemd-vs-kernel checker, both acting as a reasonable standalone exerciser of the code, and also as a way for easy inspection of deviations from systemd internal state.	2019-10-03 15:06:25 +01:00
Chris Down	6dfb92823f	cgroup: analyze: Match standard dump format We're the only ones left using = as the delimiter, which looks really weird in `systemd-analyze dump`. Use `: ` like everyone else.	2019-10-03 15:06:25 +01:00
Chris Down	74b5fb272f	cgroup: Allow checking systemd-internal limits against the kernel We currently don't have any mitigations against another privileged user on the system messing with the cgroup hierarchy, bringing the system out of line with what we've set in systemd. We also don't have any real way to surface this to the user (we do have logs, but you have to know to look in the first place). There are a few possible solutions: 1. Maintaining our own cgroup tree with the new fsopen API and having a read-only copy for everyone else. However, there are some complications on this front, and this may be infeasible in some environments. I'd rate this as a longer term effort that's tangential to this patch. 2. Actively checking for changes with {fa,i}notify and changing them back afterwards to match our configuration again. This is also possible, but it's also good to have a way to do passive monitoring of the situation without taking hard action. Also, currently daemons like senpai do actually need to modify the tree behind systemd's back (although hopefully this should be more integrated soon). This patch implements another option, where one can, on demand, monitor deviations in cgroup memory configuration from systemd's internal state. Currently the only consumer is `systemd-analyze dump`, but the interface is generic enough that it can also be exposed elsewhere later (for example, over D-Bus). Currently only memory limit style properties are supported, but later I also plan to expand this out to other properties that systemd should have ultimate control over.	2019-10-03 15:06:25 +01:00
Zbigniew Jędrzejewski-Szmek	86e94d95d0	Merge pull request #13246 from keszybz/add-SystemdOptions-efi-variable Add efi variable to augment /proc/cmdline	2019-10-03 12:19:44 +02:00
Chris Down	64fe532e90	cgroup: Respect DefaultMemoryMin when setting memory.min This is an oversight from https://github.com/systemd/systemd/pull/12332. Sadly the tests didn't catch it since it requires a real cgroup hierarchy to see, and it wasn't seen in prod since we're only currently using DefaultMemoryLow, not DefaultMemoryMin. :-(	2019-09-30 18:41:21 +01:00
Chris Down	7c9d2b7993	cgroup: Check ancestor memory min for unified memory config Otherwise we might not enable it when we should, ie. DefaultMemoryMin is set in a parent, but not MemoryMin in the current unit.	2019-09-30 18:24:26 +01:00
Pavel Hrdina	047f5d63d7	cgroup: introduce support for cgroup v2 CPUSET controller Introduce support for configuring cpus and mems for processes using cgroup v2 CPUSET controller. This allows users to limit which cpus and memory NUMA nodes can be used by processes to better utilize system resources. The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems where the requested configuration is written. However, it doesn't mean that the requested configuration will be actually used as parent cgroup may limit the cpus or mems as well. In order to reflect the real configuration cgroup v2 provides read-only files cpuset.cpus.effective and cpuset.mems.effective which are exported to users as well.	2019-09-24 15:16:07 +02:00
Zbigniew Jędrzejewski-Szmek	fdb3decaa7	util-lib: move some functions from basic/cgroup-util to shared/cgroup-setup This way less stuff needs to be in basic. Initially, I wanted to move all the parts of cgroup-utils.[ch] that depend on efivars.[ch] to shared, because efivars.[ch] is in shared/. Later on, I decide to split efivars.[ch], so the move done in this patch is not necessary anymore. Nevertheless, it is still valid on its own. If at some point we want to expose libbasic, it is better to to not have stuff that belong in libshared there.	2019-09-16 18:08:00 +02:00
Zbigniew Jędrzejewski-Szmek	d4d99bc6e4	basic/cgroup-util: let cgroup_unified_flush() return the detected hierarchy This avoid the use of the global variable. Also rename cgroup_unified_update() to cgroup_unified_cached() and cgroup_unified_flush() to cgroup_unified() to better reflect their new roles.	2019-09-16 18:06:20 +02:00
Kai Krakow	2dbc45aea7	cgroup: Also set io.bfq.weight Current kernels with BFQ scheduler do not yet set their IO weight through "io.weight" but through "io.bfq.weight" (using a slightly different interface supporting only default weights, not per-device weights). This commit enables "IOWeight=" to just to that. This patch may be dropped at some time later. Github-Link: https://github.com/systemd/systemd/issues/7057 Signed-off-by: Kai Krakow <kai@kaishome.de>	2019-08-20 11:50:59 +02:00
Zbigniew Jędrzejewski-Szmek	a505166845	Merge pull request #13096 from keszybz/unit-loading Preparatory work for the unit loading rework	2019-07-19 21:47:10 +02:00
Zbigniew Jędrzejewski-Szmek	f4c43a8115	pid1: do not say "(null)" if no disabled controllers It looks like we made a mistake. The list is just empty, that's all.	2019-07-19 16:51:14 +02:00
Zbigniew Jędrzejewski-Szmek	217b7b33cc	pid1: order jobs that execute processes with lower priority We can meaningfully compare jobs for units which have cpu weight or nice set. But non-exec units those have those set. Starting non-exec jobs first allows us to get them out of the queue quickly, and consider more jobs for starting. If we have service A, and socket B, and service C which is after socket B, and we want to start both A and C, and C has higher cpu weight, if we get B out of the way first, we'll know that we can start both A and C, and we'll start C first. Also invert the comparisons using CMP() so they are always done left vs. right, and negate when returning instead. Follow-up for `da8e178296`.	2019-07-19 14:38:52 +09:00
Michael Olbrich	da8e178296	job: make the run queue order deterministic Jobs are added to the run queue in random order. This happens because most jobs are added by iterating over the transaction or dependency hash maps. As a result, jobs that can be executed at the same time are started in a different order each time. On small embedded devices this can cause a measurable jitter for the point in time when a job starts (~100ms jitter for 10 units that are started in random order). This results is a similar jitter for the boot time. This is undesirable in general and make optimizing the boot time a lot harder. Also, jobs that should have a higher priority because the unit has a higher CPU weight might get executed later than others. Fix this by turning the job run_queue into a Prioq and sort by the following criteria (use the next if the values are equal): - CPU weight - nice level - unit type - unit name The last one is just there for deterministic sorting to avoid any jitter.	2019-07-18 10:28:39 +02:00
Zbigniew Jędrzejewski-Szmek	95b21cff0e	Apply empty_to_root() in three more spots for safety	2019-07-15 18:39:26 +02:00
Kai Lüke	fab347489f	bpf-firewall: custom BPF programs through IP(Ingress\|Egress)FilterPath= Takes a single /sys/fs/bpf/pinned_prog string as argument, but may be specified multiple times. An empty assignment resets all previous filters. Closes https://github.com/systemd/systemd/issues/10227	2019-06-25 09:56:16 +02:00
Yu Watanabe	270384b2d4	tree-wide: replace strjoina() with prefix_roota()	2019-06-25 01:31:26 +09:00
Lennart Poettering	cee97d5768	Merge pull request #12836 from yuwata/tree-wide-replace-strjoin tree-wide: replace strjoin() with path_join()	2019-06-22 20:02:46 +02:00
Yu Watanabe	657ee2d82b	tree-wide: replace strjoin() with path_join()	2019-06-21 03:26:16 +09:00
Donald Buczek	0219b3524f	cgroup: Continue unit reset if cgroup is busy When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes https://github.com/systemd/systemd/issues/12386	2019-06-20 10:16:53 +02:00
Chris Down	c710d3b430	cgroup: Prevent theoretical nullptr deref in unit mask calculation	2019-06-07 06:33:53 +01:00
Zbigniew Jędrzejewski-Szmek	84d2744bc5	Move warning about unsupported BPF firewall right before the firewall would be created There's no need to warn about the firewall when parsing, because the unit might not be started at all. Let's warn only when we're actually preparing to start the firewall. This changes behaviour: - the warning is printed just once for all unit types, and not once for normal units and once for transient units. - on repeat warnings, the message is not printed at all. There's already detailed debug info from bpf_firewall_compile(), so we don't need to repeat ourselves. - when we are not root, let's say precisely that, not "lack of necessary privileges" and "the local system does not support BPF/cgroup firewalling". Fixes #12673.	2019-06-04 17:22:37 +02:00
Zbigniew Jędrzejewski-Szmek	2ba6ae6b2b	core: do an extra check if oom was triggered when handling sigchild Should fix #12425.	2019-05-20 16:37:06 +02:00
Ben Boeckel	5238e95759	codespell: fix spelling errors	2019-04-29 16:47:18 +02:00
Lennart Poettering	d8974757c4	Merge pull request #12407 from keszybz/two-unrelated-cleanups Two unrelated cleanups	2019-04-26 23:43:27 +02:00

1 2 3 4 5 ...

340 Commits