Systemd

Author	SHA1	Message	Date
Pavel Hrdina	047f5d63d7	cgroup: introduce support for cgroup v2 CPUSET controller Introduce support for configuring cpus and mems for processes using cgroup v2 CPUSET controller. This allows users to limit which cpus and memory NUMA nodes can be used by processes to better utilize system resources. The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems where the requested configuration is written. However, it doesn't mean that the requested configuration will be actually used as parent cgroup may limit the cpus or mems as well. In order to reflect the real configuration cgroup v2 provides read-only files cpuset.cpus.effective and cpuset.mems.effective which are exported to users as well.	2019-09-24 15:16:07 +02:00
Kai Krakow	2dbc45aea7	cgroup: Also set io.bfq.weight Current kernels with BFQ scheduler do not yet set their IO weight through "io.weight" but through "io.bfq.weight" (using a slightly different interface supporting only default weights, not per-device weights). This commit enables "IOWeight=" to just to that. This patch may be dropped at some time later. Github-Link: https://github.com/systemd/systemd/issues/7057 Signed-off-by: Kai Krakow <kai@kaishome.de>	2019-08-20 11:50:59 +02:00
Zbigniew Jędrzejewski-Szmek	a505166845	Merge pull request #13096 from keszybz/unit-loading Preparatory work for the unit loading rework	2019-07-19 21:47:10 +02:00
Zbigniew Jędrzejewski-Szmek	f4c43a8115	pid1: do not say "(null)" if no disabled controllers It looks like we made a mistake. The list is just empty, that's all.	2019-07-19 16:51:14 +02:00
Zbigniew Jędrzejewski-Szmek	217b7b33cc	pid1: order jobs that execute processes with lower priority We can meaningfully compare jobs for units which have cpu weight or nice set. But non-exec units those have those set. Starting non-exec jobs first allows us to get them out of the queue quickly, and consider more jobs for starting. If we have service A, and socket B, and service C which is after socket B, and we want to start both A and C, and C has higher cpu weight, if we get B out of the way first, we'll know that we can start both A and C, and we'll start C first. Also invert the comparisons using CMP() so they are always done left vs. right, and negate when returning instead. Follow-up for `da8e178296`.	2019-07-19 14:38:52 +09:00
Michael Olbrich	da8e178296	job: make the run queue order deterministic Jobs are added to the run queue in random order. This happens because most jobs are added by iterating over the transaction or dependency hash maps. As a result, jobs that can be executed at the same time are started in a different order each time. On small embedded devices this can cause a measurable jitter for the point in time when a job starts (~100ms jitter for 10 units that are started in random order). This results is a similar jitter for the boot time. This is undesirable in general and make optimizing the boot time a lot harder. Also, jobs that should have a higher priority because the unit has a higher CPU weight might get executed later than others. Fix this by turning the job run_queue into a Prioq and sort by the following criteria (use the next if the values are equal): - CPU weight - nice level - unit type - unit name The last one is just there for deterministic sorting to avoid any jitter.	2019-07-18 10:28:39 +02:00
Zbigniew Jędrzejewski-Szmek	95b21cff0e	Apply empty_to_root() in three more spots for safety	2019-07-15 18:39:26 +02:00
Kai Lüke	fab347489f	bpf-firewall: custom BPF programs through IP(Ingress\|Egress)FilterPath= Takes a single /sys/fs/bpf/pinned_prog string as argument, but may be specified multiple times. An empty assignment resets all previous filters. Closes https://github.com/systemd/systemd/issues/10227	2019-06-25 09:56:16 +02:00
Yu Watanabe	270384b2d4	tree-wide: replace strjoina() with prefix_roota()	2019-06-25 01:31:26 +09:00
Lennart Poettering	cee97d5768	Merge pull request #12836 from yuwata/tree-wide-replace-strjoin tree-wide: replace strjoin() with path_join()	2019-06-22 20:02:46 +02:00
Yu Watanabe	657ee2d82b	tree-wide: replace strjoin() with path_join()	2019-06-21 03:26:16 +09:00
Donald Buczek	0219b3524f	cgroup: Continue unit reset if cgroup is busy When part of the cgroup hierarchy cannot be deleted (e.g. because there are still processes in it), do not exit unit_prune_cgroup early, but continue so that u->cgroup_realized is reset. Log the known case of non-empty cgroups at debug level and other errors at warning level. Fixes https://github.com/systemd/systemd/issues/12386	2019-06-20 10:16:53 +02:00
Chris Down	c710d3b430	cgroup: Prevent theoretical nullptr deref in unit mask calculation	2019-06-07 06:33:53 +01:00
Zbigniew Jędrzejewski-Szmek	84d2744bc5	Move warning about unsupported BPF firewall right before the firewall would be created There's no need to warn about the firewall when parsing, because the unit might not be started at all. Let's warn only when we're actually preparing to start the firewall. This changes behaviour: - the warning is printed just once for all unit types, and not once for normal units and once for transient units. - on repeat warnings, the message is not printed at all. There's already detailed debug info from bpf_firewall_compile(), so we don't need to repeat ourselves. - when we are not root, let's say precisely that, not "lack of necessary privileges" and "the local system does not support BPF/cgroup firewalling". Fixes #12673.	2019-06-04 17:22:37 +02:00
Zbigniew Jędrzejewski-Szmek	2ba6ae6b2b	core: do an extra check if oom was triggered when handling sigchild Should fix #12425.	2019-05-20 16:37:06 +02:00
Ben Boeckel	5238e95759	codespell: fix spelling errors	2019-04-29 16:47:18 +02:00
Lennart Poettering	d8974757c4	Merge pull request #12407 from keszybz/two-unrelated-cleanups Two unrelated cleanups	2019-04-26 23:43:27 +02:00
Zbigniew Jędrzejewski-Szmek	c5b7ae0edb	Merge pull request #12074 from poettering/io-acct expose IO stats on the bus and in "systemctl status" and "systemd-run --wait"	2019-04-25 11:59:37 +02:00
Zbigniew Jędrzejewski-Szmek	c5322608a5	core: adjust unit_get_ancestor_memory_{low,min}() to work with units which don't have a CGroupContext Coverity doesn't like the fact that unit_get_cgroup_context() returns NULL for unit types that don't have a CGroupContext. We don't expect to call those functions with such unit types, so this isn't an immediate problem, but we can make things more robust by handling this case. CID #1400683, #1400684.	2019-04-25 11:13:02 +02:00
Zbigniew Jędrzejewski-Szmek	b6411f716c	Merge pull request #12332 from cdown/default_min cgroup: Add support for propagation of memory.min	2019-04-25 11:06:45 +02:00
Anita Zhang	25cc30c4c8	core: support DisableControllers= for transient units	2019-04-22 11:52:08 -07:00
Chris Down	7ad5439e06	unit: Add DefaultMemoryMin	2019-04-16 18:45:04 +01:00
Chris Down	6264b85e92	cgroup: Create UNIT_DEFINE_ANCESTOR_MEMORY_LOOKUP This is in preparation for creating unit_get_ancestor_memory_min.	2019-04-16 18:39:51 +01:00
Chris Down	c52db42b78	cgroup: Implement default propagation of MemoryLow with DefaultMemoryLow In cgroup v2 we have protection tunables -- currently MemoryLow and MemoryMin (there will be more in future for other resources, too). The design of these protection tunables requires not only intermediate cgroups to propagate protections, but also the units at the leaf of that resource's operation to accept it (by setting MemoryLow or MemoryMin). This makes sense from an low-level API design perspective, but it's a good idea to also have a higher-level abstraction that can, by default, propagate these resources to children recursively. In this patch, this happens by having descendants set memory.low to N if their ancestor has DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow value. Any affected unit can opt out of this propagation by manually setting `MemoryLow` to some value in its unit configuration. A unit can also stop further propagation by setting `DefaultMemoryLow=` with no argument. This removes further propagation in the subtree, but has no effect on the unit itself (for that, use `MemoryLow=0`). Our use case in production is simplifying the configuration of machines which heavily rely on memory protection tunables, but currently require tweaking a huge number of unit files to make that a reality. This directive makes that significantly less fragile, and decreases the risk of misconfiguration. After this patch is merged, I will implement DefaultMemoryMin= using the same principles.	2019-04-12 17:23:58 +02:00
Lennart Poettering	fbe14fc9a7	croup: expose IO accounting data per unit This was the last kind of accounting still not exposed on for each unit. Let's fix that. Note that this is a relatively simplistic approach: we don't expose per-device stats, but sum them all up, much like cgtop does. This kind of metric is probably the most interesting for most usecases, and covers the "systemctl status" output best. If we want per-device stats one day we can of course always add that eventually.	2019-04-12 14:25:44 +02:00
Lennart Poettering	9b2559a13e	core: add new call unit_reset_accounting() It's a simple wrapper for resetting both IP and CPU accounting in one go. This will become particularly useful when we also needs this to reset IO accounting (to be added in a later commit).	2019-04-12 14:25:44 +02:00
Lennart Poettering	0bbff7d638	cgroup: get rid of a local variable	2019-04-12 14:25:44 +02:00
Lennart Poettering	afcfaa695c	core: implement OOMPolicy= and watch cgroups for OOM killings This adds a new per-service OOMPolicy= (along with a global DefaultOOMPolicy=) that controls what to do if a process of the service is killed by the kernel's OOM killer. It has three different values: "continue" (old behaviour), "stop" (terminate the service), "kill" (let the kernel kill all the service's processes). On top of that, track OOM killer events per unit: generate a per-unit structured, recognizable log message when we see an OOM killer event, and put the service in a failure state if an OOM killer event was seen and the selected policy was not "continue". A new "result" is defined for this case: "oom-kill". All of this relies on new cgroupv2 kernel functionality: the "memory.events" notification interface and the "memory.oom.group" attribute (which makes the kernel kill all cgroup processes automatically).	2019-04-09 11:17:58 +02:00
Lennart Poettering	0bb814c2c2	core: rename cgroup_inotify_wd → cgroup_control_inotify_wd Let's rename the .cgroup_inotify_wd field of the Unit object to .cgroup_control_inotify_wd. Let's similarly rename the hashmap .cgroup_inotify_wd_unit of the Manager object to .cgroup_control_inotify_wd_unit. Why? As preparation for a later commit that allows us to watch the "memory.events" cgroup attribute file in addition to the "cgroup.events" file we already watch with the fields above. In that later commit we'll add new fields "cgroup_memory_inotify_wd" to Unit and "cgroup_memory_inotify_wd_unit" to Manager, that are used to watch these other events file. No change in behaviour. Just some renaming.	2019-04-09 11:17:57 +02:00
Lennart Poettering	5210387ea6	core: check for redundant operation before doing allocation	2019-04-09 11:17:57 +02:00
Lennart Poettering	cbe83389d5	core: rearrange cgroup empty events a bit So far the priorities for cgroup empty event handling were pretty weird. The raw events (on cgroupsv2 from inotify, on cgroupsv1 from the agent dgram socket) where scheduled at a lower priority than the cgroup empty queue dispatcher. Let's swap that and ensure that we can coalesce events more agressively: let's process the raw events at higher priority than the cgroup empty event (which remains at the same prio).	2019-04-09 11:17:57 +02:00
Franck Bui	f75f613d25	core: reduce the number of stalled PIDs from the watched processes list when possible Some PIDs can remain in the watched list even though their processes have exited since a long time. It can easily happen if the main process of a forking service manages to spawn a child before the control process exits for example. However when a pid is about to be mapped to a unit by calling unit_watch_pid(), the caller usually knows if the pid should belong to this unit exclusively: if we just forked() off a child, then we can be sure that its PID is otherwise unused. In this case we take this opportunity to remove any stalled PIDs from the watched process list. If we learnt about a PID in any other form (for example via PID file, via searching, MAINPID= and so on), then we can't assume anything.	2019-03-20 10:51:49 +01:00
Franck Bui	4d05154600	process-util: introduce pid_is_my_child() helper No functional changes.	2019-03-20 10:51:49 +01:00
Lennart Poettering	d8b4d14df4	util: split out nulstr related stuff to nulstr-util.[ch]	2019-03-14 13:25:52 +01:00
Filipe Brandenburger	527ede0c63	core: downgrade CPUQuotaPeriodSec= clamping logs to debug After the first warning log, further messages are downgraded to LOG_DEBUG.	2019-02-14 11:04:42 -08:00
Filipe Brandenburger	10f2864111	core: add CPUQuotaPeriodSec= This new setting allows configuration of CFS period on the CPU cgroup, instead of using a hardcoded default of 100ms. Tested: - Legacy cgroup + Unified cgroup - systemctl set-property - systemctl show - Confirmed that the cgroup settings (such as cpu.cfs_period_ns) were set appropriately, including updating the CPU quota (cpu.cfs_quota_ns) when CPUQuotaPeriodSec= is updated. - Checked that clamping works properly when either period or (quota * period) are below the resolution of 1ms, or if period is above the max of 1s.	2019-02-14 11:04:42 -08:00
Zbigniew Jędrzejewski-Szmek	c482724aa5	procfs-util: expose functionality to query total memory procfs_memory_get_current is renamed to procfs_memory_get_used, because "current" can mean anything, including total memory, used memory, and free memory, as long as the value is up to date. No functional change.	2019-01-22 17:43:13 +01:00
YunQiang Su	f5855697aa	Pass separate dev_t var to device_path_parse_major_minor MIPS/O32's st_rdev member of struct stat is unsigned long, which is 32bit, while dev_t is defined as 64bit, which make some problems in device_path_parse_major_minor. Don't pass st.st_rdev, st_mode to device_path_parse_major_minor, while pass 2 seperate variables. The result of stat is alos copied out into these 2 variables. Fixes: #11247	2019-01-03 15:04:08 +01:00
Chris Down	4e1dfa45e9	cgroup: s/cgroups? ?v?([0-9])/cgroup v\1/gI Nitpicky, but we've used a lot of random spacings and names in the past, but we're trying to be completely consistent on "cgroup vN" now. Generated by `fd -0 \| xargs -0 -n1 sed -ri --follow-symlinks 's/cgroups? ?v?([0-9])/cgroup v\1/gI'`. I manually ignored places where it's not appropriate to replace (eg. "cgroup2" fstype and in src/shared/linux).	2019-01-03 11:32:40 +09:00
Lennart Poettering	2d41e9b7a0	Merge pull request #11143 from keszybz/enable-symlink Runtime mask symlink confusion fix	2018-12-16 12:37:07 +01:00
Chris Down	cb5e3bc37d	cgroup: Don't explicitly check for member in UNIT_BEFORE The parent slice is always filtered ahead of time from UNIT_BEFORE, so checking if the current member is the same as the parent unit will never pass. I may also write a SLICE_FOREACH_CHILD macro to remove some more of the parent slice checks, but this requires a bit of a rework and general refactoring and may not be worth it, so let's just do this for now.	2018-12-12 20:50:10 +01:00
Zbigniew Jędrzejewski-Szmek	303ee60151	Mark data and userdata params to specifier_printf() as const It would be very wrong if any of the specfier printf calls modified any of the objects or data being printed. Let's mark all arguments as const (primarily to make it easier for the reader to see where modifications cannot occur).	2018-12-12 16:45:33 +01:00
Lennart Poettering	d742f4b54b	cgroup: correct mangling of return values Let's nor return the unmangled return value before we actually mangle it. Fixes: #11062	2018-12-10 16:09:41 +01:00
Lennart Poettering	92a993041a	cgroup: call cg_all_unified() right before using the result Let's not query it before we actually need it.	2018-12-10 16:09:41 +01:00
Lennart Poettering	ea900d2bfe	Merge pull request #11009 from poettering/root-cgroup-again tweak root cgroup attribute fiddling for cgroupsv1 again	2018-12-04 12:33:03 +01:00
Chris Down	c72703e26d	cgroup: Add DisableControllers= directive to disable controller in subtree Some controllers (like the CPU controller) have a performance cost that is non-trivial on certain workloads. While this can be mitigated and improved to an extent, there will for some controllers always be some overheads associated with the benefits gained from the controller. Inside Facebook, the fix applied has been to disable the CPU controller forcibly with `cgroup_disable=cpu` on the kernel command line. This presents a problem: to disable or reenable the controller, a reboot is required, but this is quite cumbersome and slow to do for many thousands of machines, especially machines where disabling/enabling a stateful service on a machine is a matter of several minutes. Currently systemd provides some configuration knobs for these in the form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the like. The limitation of these is that Default*Accounting is overrideable by individual services, of which any one could decide to reenable a controller within the hierarchy at any point just by using a controller feature implicitly (eg. `CPUWeight`), even if the use of that CPU feature could just be opportunistic. Since many services are provided by the distribution, or by upstream teams at a particular organisation, it's not a sustainable solution to simply try to find and remove offending directives from these units. This commit presents a more direct solution -- a DisableControllers= directive that forcibly disallows a controller from being enabled within a subtree.	2018-12-03 15:40:31 +00:00
Chris Down	4f6f62e468	cgroup: Traverse leaves to realised cgroup to release controllers This adds a depth-first version of unit_realize_cgroup_now which can only do depth-first disabling of controllers, in preparation for the DisableController= directive.	2018-12-03 14:37:39 +00:00
Chris Down	a57669d290	cgroup: Rework unit_realize_cgroup_now to explicitly be breadth-first systemd currently doesn't really expend much effort in disabling controllers. unit_realize_cgroup_now may be able to disable a controller in the basic case when using cgroup v2, but generally won't manage as downstream dependents may still use it. This code doesn't add any logic to fix that, but it starts the process of moving to have a breadth-first version of unit_realize_cgroup_now for enabling, and a depth-first version of unit_realize_cgroup_now for disabling.	2018-12-03 14:37:39 +00:00
Chris Down	0d2d6fbf15	cgroup: Move attribute application into unit_create_cgroup We always end up doing these together, so just colocate them and require manager state for unit_create_cgroup.	2018-12-03 14:37:38 +00:00
Lennart Poettering	67e2ea1542	cgroup: suffix unit file settings with "=" in log output Let's follow our recommendations from CODING_STYLE and suffix unit file settings with "=" everywhere.	2018-12-01 12:57:51 +01:00

1 2 3 4 5 ...

307 commits