Systemd

Author	SHA1	Message	Date
Zbigniew Jędrzejewski-Szmek	c5b7ae0edb	Merge pull request #12074 from poettering/io-acct expose IO stats on the bus and in "systemctl status" and "systemd-run --wait"	2019-04-25 11:59:37 +02:00
Zbigniew Jędrzejewski-Szmek	c5322608a5	core: adjust unit_get_ancestor_memory_{low,min}() to work with units which don't have a CGroupContext Coverity doesn't like the fact that unit_get_cgroup_context() returns NULL for unit types that don't have a CGroupContext. We don't expect to call those functions with such unit types, so this isn't an immediate problem, but we can make things more robust by handling this case. CID #1400683, #1400684.	2019-04-25 11:13:02 +02:00
Zbigniew Jędrzejewski-Szmek	b6411f716c	Merge pull request #12332 from cdown/default_min cgroup: Add support for propagation of memory.min	2019-04-25 11:06:45 +02:00
Anita Zhang	25cc30c4c8	core: support DisableControllers= for transient units	2019-04-22 11:52:08 -07:00
Chris Down	7ad5439e06	unit: Add DefaultMemoryMin	2019-04-16 18:45:04 +01:00
Chris Down	6264b85e92	cgroup: Create UNIT_DEFINE_ANCESTOR_MEMORY_LOOKUP This is in preparation for creating unit_get_ancestor_memory_min.	2019-04-16 18:39:51 +01:00
Chris Down	c52db42b78	cgroup: Implement default propagation of MemoryLow with DefaultMemoryLow In cgroup v2 we have protection tunables -- currently MemoryLow and MemoryMin (there will be more in future for other resources, too). The design of these protection tunables requires not only intermediate cgroups to propagate protections, but also the units at the leaf of that resource's operation to accept it (by setting MemoryLow or MemoryMin). This makes sense from an low-level API design perspective, but it's a good idea to also have a higher-level abstraction that can, by default, propagate these resources to children recursively. In this patch, this happens by having descendants set memory.low to N if their ancestor has DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow value. Any affected unit can opt out of this propagation by manually setting `MemoryLow` to some value in its unit configuration. A unit can also stop further propagation by setting `DefaultMemoryLow=` with no argument. This removes further propagation in the subtree, but has no effect on the unit itself (for that, use `MemoryLow=0`). Our use case in production is simplifying the configuration of machines which heavily rely on memory protection tunables, but currently require tweaking a huge number of unit files to make that a reality. This directive makes that significantly less fragile, and decreases the risk of misconfiguration. After this patch is merged, I will implement DefaultMemoryMin= using the same principles.	2019-04-12 17:23:58 +02:00
Lennart Poettering	fbe14fc9a7	croup: expose IO accounting data per unit This was the last kind of accounting still not exposed on for each unit. Let's fix that. Note that this is a relatively simplistic approach: we don't expose per-device stats, but sum them all up, much like cgtop does. This kind of metric is probably the most interesting for most usecases, and covers the "systemctl status" output best. If we want per-device stats one day we can of course always add that eventually.	2019-04-12 14:25:44 +02:00
Lennart Poettering	9b2559a13e	core: add new call unit_reset_accounting() It's a simple wrapper for resetting both IP and CPU accounting in one go. This will become particularly useful when we also needs this to reset IO accounting (to be added in a later commit).	2019-04-12 14:25:44 +02:00
Lennart Poettering	0bbff7d638	cgroup: get rid of a local variable	2019-04-12 14:25:44 +02:00
Lennart Poettering	afcfaa695c	core: implement OOMPolicy= and watch cgroups for OOM killings This adds a new per-service OOMPolicy= (along with a global DefaultOOMPolicy=) that controls what to do if a process of the service is killed by the kernel's OOM killer. It has three different values: "continue" (old behaviour), "stop" (terminate the service), "kill" (let the kernel kill all the service's processes). On top of that, track OOM killer events per unit: generate a per-unit structured, recognizable log message when we see an OOM killer event, and put the service in a failure state if an OOM killer event was seen and the selected policy was not "continue". A new "result" is defined for this case: "oom-kill". All of this relies on new cgroupv2 kernel functionality: the "memory.events" notification interface and the "memory.oom.group" attribute (which makes the kernel kill all cgroup processes automatically).	2019-04-09 11:17:58 +02:00
Lennart Poettering	0bb814c2c2	core: rename cgroup_inotify_wd → cgroup_control_inotify_wd Let's rename the .cgroup_inotify_wd field of the Unit object to .cgroup_control_inotify_wd. Let's similarly rename the hashmap .cgroup_inotify_wd_unit of the Manager object to .cgroup_control_inotify_wd_unit. Why? As preparation for a later commit that allows us to watch the "memory.events" cgroup attribute file in addition to the "cgroup.events" file we already watch with the fields above. In that later commit we'll add new fields "cgroup_memory_inotify_wd" to Unit and "cgroup_memory_inotify_wd_unit" to Manager, that are used to watch these other events file. No change in behaviour. Just some renaming.	2019-04-09 11:17:57 +02:00
Lennart Poettering	5210387ea6	core: check for redundant operation before doing allocation	2019-04-09 11:17:57 +02:00
Lennart Poettering	cbe83389d5	core: rearrange cgroup empty events a bit So far the priorities for cgroup empty event handling were pretty weird. The raw events (on cgroupsv2 from inotify, on cgroupsv1 from the agent dgram socket) where scheduled at a lower priority than the cgroup empty queue dispatcher. Let's swap that and ensure that we can coalesce events more agressively: let's process the raw events at higher priority than the cgroup empty event (which remains at the same prio).	2019-04-09 11:17:57 +02:00
Franck Bui	f75f613d25	core: reduce the number of stalled PIDs from the watched processes list when possible Some PIDs can remain in the watched list even though their processes have exited since a long time. It can easily happen if the main process of a forking service manages to spawn a child before the control process exits for example. However when a pid is about to be mapped to a unit by calling unit_watch_pid(), the caller usually knows if the pid should belong to this unit exclusively: if we just forked() off a child, then we can be sure that its PID is otherwise unused. In this case we take this opportunity to remove any stalled PIDs from the watched process list. If we learnt about a PID in any other form (for example via PID file, via searching, MAINPID= and so on), then we can't assume anything.	2019-03-20 10:51:49 +01:00
Franck Bui	4d05154600	process-util: introduce pid_is_my_child() helper No functional changes.	2019-03-20 10:51:49 +01:00
Lennart Poettering	d8b4d14df4	util: split out nulstr related stuff to nulstr-util.[ch]	2019-03-14 13:25:52 +01:00
Filipe Brandenburger	527ede0c63	core: downgrade CPUQuotaPeriodSec= clamping logs to debug After the first warning log, further messages are downgraded to LOG_DEBUG.	2019-02-14 11:04:42 -08:00
Filipe Brandenburger	10f2864111	core: add CPUQuotaPeriodSec= This new setting allows configuration of CFS period on the CPU cgroup, instead of using a hardcoded default of 100ms. Tested: - Legacy cgroup + Unified cgroup - systemctl set-property - systemctl show - Confirmed that the cgroup settings (such as cpu.cfs_period_ns) were set appropriately, including updating the CPU quota (cpu.cfs_quota_ns) when CPUQuotaPeriodSec= is updated. - Checked that clamping works properly when either period or (quota * period) are below the resolution of 1ms, or if period is above the max of 1s.	2019-02-14 11:04:42 -08:00
Zbigniew Jędrzejewski-Szmek	c482724aa5	procfs-util: expose functionality to query total memory procfs_memory_get_current is renamed to procfs_memory_get_used, because "current" can mean anything, including total memory, used memory, and free memory, as long as the value is up to date. No functional change.	2019-01-22 17:43:13 +01:00
YunQiang Su	f5855697aa	Pass separate dev_t var to device_path_parse_major_minor MIPS/O32's st_rdev member of struct stat is unsigned long, which is 32bit, while dev_t is defined as 64bit, which make some problems in device_path_parse_major_minor. Don't pass st.st_rdev, st_mode to device_path_parse_major_minor, while pass 2 seperate variables. The result of stat is alos copied out into these 2 variables. Fixes: #11247	2019-01-03 15:04:08 +01:00
Chris Down	4e1dfa45e9	cgroup: s/cgroups? ?v?([0-9])/cgroup v\1/gI Nitpicky, but we've used a lot of random spacings and names in the past, but we're trying to be completely consistent on "cgroup vN" now. Generated by `fd -0 \| xargs -0 -n1 sed -ri --follow-symlinks 's/cgroups? ?v?([0-9])/cgroup v\1/gI'`. I manually ignored places where it's not appropriate to replace (eg. "cgroup2" fstype and in src/shared/linux).	2019-01-03 11:32:40 +09:00
Lennart Poettering	2d41e9b7a0	Merge pull request #11143 from keszybz/enable-symlink Runtime mask symlink confusion fix	2018-12-16 12:37:07 +01:00
Chris Down	cb5e3bc37d	cgroup: Don't explicitly check for member in UNIT_BEFORE The parent slice is always filtered ahead of time from UNIT_BEFORE, so checking if the current member is the same as the parent unit will never pass. I may also write a SLICE_FOREACH_CHILD macro to remove some more of the parent slice checks, but this requires a bit of a rework and general refactoring and may not be worth it, so let's just do this for now.	2018-12-12 20:50:10 +01:00
Zbigniew Jędrzejewski-Szmek	303ee60151	Mark data and userdata params to specifier_printf() as const It would be very wrong if any of the specfier printf calls modified any of the objects or data being printed. Let's mark all arguments as const (primarily to make it easier for the reader to see where modifications cannot occur).	2018-12-12 16:45:33 +01:00
Lennart Poettering	d742f4b54b	cgroup: correct mangling of return values Let's nor return the unmangled return value before we actually mangle it. Fixes: #11062	2018-12-10 16:09:41 +01:00
Lennart Poettering	92a993041a	cgroup: call cg_all_unified() right before using the result Let's not query it before we actually need it.	2018-12-10 16:09:41 +01:00
Lennart Poettering	ea900d2bfe	Merge pull request #11009 from poettering/root-cgroup-again tweak root cgroup attribute fiddling for cgroupsv1 again	2018-12-04 12:33:03 +01:00
Chris Down	c72703e26d	cgroup: Add DisableControllers= directive to disable controller in subtree Some controllers (like the CPU controller) have a performance cost that is non-trivial on certain workloads. While this can be mitigated and improved to an extent, there will for some controllers always be some overheads associated with the benefits gained from the controller. Inside Facebook, the fix applied has been to disable the CPU controller forcibly with `cgroup_disable=cpu` on the kernel command line. This presents a problem: to disable or reenable the controller, a reboot is required, but this is quite cumbersome and slow to do for many thousands of machines, especially machines where disabling/enabling a stateful service on a machine is a matter of several minutes. Currently systemd provides some configuration knobs for these in the form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the like. The limitation of these is that Default*Accounting is overrideable by individual services, of which any one could decide to reenable a controller within the hierarchy at any point just by using a controller feature implicitly (eg. `CPUWeight`), even if the use of that CPU feature could just be opportunistic. Since many services are provided by the distribution, or by upstream teams at a particular organisation, it's not a sustainable solution to simply try to find and remove offending directives from these units. This commit presents a more direct solution -- a DisableControllers= directive that forcibly disallows a controller from being enabled within a subtree.	2018-12-03 15:40:31 +00:00
Chris Down	4f6f62e468	cgroup: Traverse leaves to realised cgroup to release controllers This adds a depth-first version of unit_realize_cgroup_now which can only do depth-first disabling of controllers, in preparation for the DisableController= directive.	2018-12-03 14:37:39 +00:00
Chris Down	a57669d290	cgroup: Rework unit_realize_cgroup_now to explicitly be breadth-first systemd currently doesn't really expend much effort in disabling controllers. unit_realize_cgroup_now may be able to disable a controller in the basic case when using cgroup v2, but generally won't manage as downstream dependents may still use it. This code doesn't add any logic to fix that, but it starts the process of moving to have a breadth-first version of unit_realize_cgroup_now for enabling, and a depth-first version of unit_realize_cgroup_now for disabling.	2018-12-03 14:37:39 +00:00
Chris Down	0d2d6fbf15	cgroup: Move attribute application into unit_create_cgroup We always end up doing these together, so just colocate them and require manager state for unit_create_cgroup.	2018-12-03 14:37:38 +00:00
Lennart Poettering	67e2ea1542	cgroup: suffix unit file settings with "=" in log output Let's follow our recommendations from CODING_STYLE and suffix unit file settings with "=" everywhere.	2018-12-01 12:57:51 +01:00
Lennart Poettering	be2c032781	core: don't try to write CPU quota and memory limit cgroup attrs on root cgroup In the kernel sources attempts to write to either are refused with EINVAL. Not sure why these attributes are exported anyway on cgroupsv1, but this means we really should ignore them altogether. This simplifies our code as this means cgroupsv1 is more alike cgroupsv2 in this regard. Fixes: #10969	2018-12-01 12:57:51 +01:00
Lennart Poettering	d5aecba6e0	cgroup: use device_path_parse_major_minor() also for block device paths Not only when we populate the "devices" cgroup controller we need major/minor numbers, but for the io/blkio one it's the same, hence let's use the same logic for both.	2018-11-29 20:21:39 +01:00
Lennart Poettering	846b3bd61e	stat-util: add new APIs device_path_make_{major_minor\|canonical}() and device_path_parse_major_minor() device_path_make_{major_minor\|canonical) generate device node paths given a mode_t and a dev_t. We have similar code all over the place, let's unify this in one place. The former will generate a "/dev/char/" or "/dev/block" path, and never go to disk. The latter then goes to disk and resolves that path to the actual path of the device node. device_path_parse_major_minor() reverses device_path_make_major_minor(), also withozut going to disk. We have similar code doing something like this at various places, let's unify this in a single set of functions. This also allows us to teach them special tricks, for example handling of the /run/systemd/inaccessible/{blk\|chr} device nodes, which we use for masking device nodes, and which do not exist in /dev/char/* and /dev/block/*	2018-11-29 20:21:39 +01:00
Lennart Poettering	8e8b5d2e6d	cgroups: beef up DeviceAllow= syntax a bit Previously we'd allow pattern expressions such as "char-input" to match all input devices. Internally, this would look up the right major to test in /proc/devices. With this commit the syntax is slightly extended: - "char-" can be used to match any kind of character device, and similar "block-. This expression would work previously already, but instead of actually installing a wildcard match it would install many individual matches for everything listed in /proc/devices. - "char-<MAJOR>" with "<MAJOR>" being a numerical parameter works now too. This allows clients to install whitelist items by specifying the major directly. The main reason to add these is to provide limited compat support for clients that for some reason contain whitelists with major/minor numbers (such as OCI containers).	2018-11-29 20:21:39 +01:00
Lennart Poettering	74c48bf5a8	core: add special handling for devices cgroup allow lists for /dev/block/* and /dev/char/* device nodes This adds some code to hanlde /dev/block/* and /dev/char/* device node paths specially: instead of actually stat()ing them we'll just parse the major/minor name from the name. This is useful 'hack' to allow clients to install whitelists for devices that don't actually have to exist. Also, let's similarly handle /run/systemd/inaccessible/{blk\|chr}. This allows us to simplify our built-in default whitelist to not require a "ignore_enoent" mode for these nodes. In general we should be careful with hardcoding major/minor numbers, but in this case this should safe.	2018-11-29 20:03:56 +01:00
Lennart Poettering	5af8805872	cgroup: drastically simplify caching of cgroups members mask Previously we tried to be smart: when a new unit appeared and it only added controllers to the cgroup mask we'd update the cached members mask in all parents by ORing in the controller flags in their cached values. Unfortunately this was quite broken, as we missed some conditions when this cache had to be reset (for example, when a unit got unloaded), moreover the optimization doesn't work when a controller is removed anyway (as in that case there's no other way for the parent to iterate though all children if any other, remaining child unit still needs it). Hence, let's simplify the logic substantially: instead of updating the cache on the right events (which we didn't get right), let's simply invalidate the cache, and generate it lazily when we encounter it later. This should actually result in better behaviour as we don't have to calculate the new members mask for a whole subtree whever we have the suspicion something changed, but can delay it to the point where we actually need the members mask. This allows us to simplify things quite a bit, which is good, since validating this cache for correctness is hard enough. Fixes: #9512	2018-11-23 13:41:37 +01:00
Lennart Poettering	8a0d538815	cgroup: extend comment on what unit_release_cgroup() is for	2018-11-23 13:41:37 +01:00
Lennart Poettering	1fd3a10c38	cgroup: extend reasons when we realize the enable mask After creating a cgroup we need to initialize its "cgroup.subtree_control" file with the controllers its children want to use. Currently we do so whenever the mkdir() on the cgroup succeeded, i.e. when we know the cgroup is "fresh". Let's update the condition slightly that we also do so when internally we assume a cgroup doesn't exist yet, even if it already does (maybe left-over from a previous run). This shouldn't change anything IRL but make things a bit more robust.	2018-11-23 13:41:37 +01:00
Lennart Poettering	d5095dcd30	cgroup: tighten call that detects whether we need to realize a unit's cgroup a bit, and comment why	2018-11-23 13:41:37 +01:00
Lennart Poettering	27c4ed790a	cgroup: simplify check whether it makes sense to realize a cgroup	2018-11-23 13:41:37 +01:00
Lennart Poettering	e00068e71f	cgroup: in unit_invalidate_cgroup() actually modify invalidation mask Previously this would manipulate the realization mask for invalidating the realization. This is a bit ugly though as the realization mask's primary purpose to is to reflect in which hierarchies a cgroup currently exists, and it's probably a good idea to keep that in sync with realities. We nowadays have the an explicit fields for invalidating cgroup controller information, the "cgroup_invalidated_mask", let's use this one instead. The effect is pretty much the same, as the main consumer of these masks (unit_has_mask_realize()) checks both anyway.	2018-11-23 13:41:37 +01:00
Lennart Poettering	27adcc9737	cgroup: be more careful with which controllers we can enable/disable on a cgroup This changes cg_enable_everywhere() to return which controllers are enabled for the specified cgroup. This information is then used to correctly track the enablement mask currently in effect for a unit. Moreover, when we try to turn off a controller, and this works, then this is indicates that the parent unit might succesfully turn it off now, too as our unit might have kept it busy. So far, when realizing cgroups, i.e. when syncing up the kernel representation of relevant cgroups with our own idea we would strictly work from the root to the leaves. This is generally a good approach, as when controllers are enabled this has to happen in root-to-leaves order. However, when controllers are disabled this has to happen in the opposite order: in leaves-to-root order (this is because controllers can only be enabled in a child if it is already enabled in the parent, and if it shall be disabled in the parent then it has to be disabled in the child first, otherwise it is considered busy when it is attempted to remove it in the parent). To make things complicated when invalidating a unit's cgroup membershup systemd can actually turn off some controllers previously turned on at the very same time as it turns on other controllers previously turned off. In such a case we have to work up leaves-to-root and root-to-leaves right after each other. With this patch this is implemented: we still generally operate root-to-leaves, but as soon as we noticed we successfully turned off a controller previously turned on for a cgroup we'll re-enqueue the cgroup realization for all parents of a unit, thus implementing leaves-to-root where necessary.	2018-11-23 13:41:37 +01:00
Lennart Poettering	26a17ca280	cgroup: add explanatory comment	2018-11-23 12:24:37 +01:00
Lennart Poettering	442ce7759c	cgroup: units that aren't loaded properly should not result in cgroup controllers being pulled in This shouldn't make much difference in real life, but is a bit cleaner.	2018-11-23 12:24:37 +01:00
Lennart Poettering	1649244588	cgroup: make unit_get_needs_bpf_firewall() static too	2018-11-23 12:24:37 +01:00
Lennart Poettering	53aea74a60	cgroup: make some functions static	2018-11-23 12:24:37 +01:00
Lennart Poettering	52fecf20b9	cgroup: fine tune when to apply cgroup attributes to the root cgroup Let's tweak when precisely to apply cgroup attributes on the root cgroup. With this we now follow the following rules: 1. On cgroupsv2 we never apply any regular cgroups to the host root, since the attributes generally do not exist there. 2. On cgroupsv1 we do not apply any "weight" or "shares" style attributes to the host root cgroup, since they don't make much sense on the top level where there's only one group, hence no need to compare weights against each other. The other attributes are applied to the host root cgroup however. 3. In any case we don't apply attributes to the root of container environments (and --user roots), under the assumption that this is managed by the manager further up. (Note that on cgroupsv2 this is even enforced by the kernel) 4. BPF pseudo-attributes are applied in all cases (since we can have as many of them as we want)	2018-11-23 12:24:37 +01:00
Lennart Poettering	589a5f7a38	cgroup: append \n to static strings we write to cgroup attributes This is a bit cleaner since we when we format numeric limits we append it. And this way write_string_file() doesn't have to append it.	2018-11-23 12:24:37 +01:00
Lennart Poettering	28cfdc5aeb	cgroup: tighten manager_owns_host_root_cgroup() a bit This tightening is not strictly necessary (as the m->cgroup_root check further down does the same), but let's make this explicit.	2018-11-23 12:24:37 +01:00
Lennart Poettering	611c4f8afb	cgroup: rename {manager_owns\|unit_has}_root_cgroup() → .._host_root_cgroup() Let's emphasize that this function checks for the host root cgroup, i.e. returns false for the root cgroup when we run in a container where CLONE_NEWCGROUP is used. There has been some confusion around this already, for example cgroup_context_apply() uses the function incorrectly (which we'll fix in a later commit). Just some refactoring, not change in behaviour.	2018-11-23 12:24:37 +01:00
Lennart Poettering	293d32df39	cgroup: add a common routine for writing to attributes, and logging about it We can use this at quite a few places, and this allows us to shorten our code quite a bit.	2018-11-23 12:24:37 +01:00
Lennart Poettering	39b9fefb2e	cgroup: add a new macro for determining log level for cgroup attr write failures For now, let's use it only at one place, but a follow-up commit will make more use of it.	2018-11-23 12:24:37 +01:00
Lennart Poettering	2c74e12bb3	cgroup: ignore EPERM for a couple of more attribute writes	2018-11-23 12:24:37 +01:00
Lennart Poettering	8c83840772	cgroup: add comment explaining why we ignore EINVAL at two places These are just copies from further down.	2018-11-23 12:24:37 +01:00
Lennart Poettering	73fe5314bf	cgroup: suffix settings with "=" in log messages where appropriate	2018-11-23 12:24:37 +01:00
Lennart Poettering	a0c339ed4b	cgroup: only install cgroup release agent when we own the root cgroup If we run in a container we shouldn't patch around this, and most likely we can't anyway, and there's not much point in complaining about this. Hence let's strictly say: the agent is private property of the host's system instance, nothing else.	2018-11-23 12:24:37 +01:00
Lennart Poettering	de8a711a58	cgroup: use structured initialization	2018-11-23 12:24:37 +01:00
Chris Down	f98c25850f	cgroup v2: Don't require CPU controller for CPU accounting in 4.15+ systemd only uses functions that are as of Linux 4.15+ provided externally to the CPU controller (currently usage_usec), so if we have a new enough kernel, we don't need to set CGROUP_MASK_CPU for CPUAccounting=true as the CPU controller does not need to necessarily be enabled in this case. Part of this patch is modelled on an earlier patch by Ryutaroh Matsumoto (see PR #9665).	2018-11-18 12:21:41 +00:00
Lennart Poettering	fae9bc298a	cgroup: when determining which controllers we need, always extend the mask according to cpu/cpuacct joint mounting Note that for cgroup_context_get_mask() this doesn't actually change much, but it does prepare the ground for #10507 later on.	2018-11-16 14:54:13 +01:00
Lennart Poettering	8d33dca2ff	core: fix capitalization of CPUShares= settings	2018-11-16 14:46:49 +01:00
Lennart Poettering	c2baf11c36	cgroup: actually reset the cgroup invalidation mask after we made our changes Previously we never unmasked the mask after it was set once. Let's fix that.	2018-11-08 15:20:52 +01:00
Yu Watanabe	5e1ee764e1	core: include error cause in log message	2018-10-20 01:40:42 +09:00
Lennart Poettering	490c5a37cb	tree-wide: some automatic coccinelle fixes (#10463 ) Nothing fancy, just coccinelle doing its work.	2018-10-20 00:07:46 +09:00
Lennart Poettering	c66e60a838	cgroup: FOREACH_LINE excorcism	2018-10-18 16:23:45 +02:00
Lennart Poettering	913c898ca0	cgroup: voidify a few things	2018-10-13 12:37:13 +02:00
Lennart Poettering	b9839ac9d9	cgroup: make sure whitelist_device() always returns a valid return value CID 1396094	2018-10-13 12:37:13 +02:00
Zbigniew Jędrzejewski-Szmek	f436470ae1	Merge pull request #10343 from poettering/manager-state-fix various fixes for PID1's Manager object	2018-10-10 12:36:16 +02:00
Lennart Poettering	638cece45d	core: clean up test run flags Let's make them typesafe, and let's add a nice macro helper for checking if we are in a test run, which should make testing for this much easier to read for most cases.	2018-10-09 19:43:43 +02:00
Roman Gushchin	084c700780	core: support cgroup v2 device controller Cgroup v2 provides the eBPF-based device controller, which isn't currently supported by systemd. This commit aims to provide such support. There are no user-visible changes, just the device policy and whitelist start working if cgroup v2 is used.	2018-10-09 09:47:51 -07:00
Roman Gushchin	17f149556a	core: refactor bpf firewall support into a pseudo-controller The idea is to introduce a concept of bpf-based pseudo-controllers to make adding new bpf-based features easier.	2018-10-09 09:46:08 -07:00
Tejun Heo	6ae4283cb1	core: add IODeviceLatencyTargetSec This adds support for the following proposed latency based IO control mechanism. https://lkml.org/lkml/2018/6/5/428	2018-08-22 16:46:18 +02:00
Yu Watanabe	fd870bac25	core: introduce cgroup_add_device_allow()	2018-08-06 13:42:14 +09:00
Tejun Heo	4842263577	core: add MemoryMin The kernel added support for a new cgroup memory controller knob memory.min in bf8d5d52ffe8 ("memcg: introduce memory.min") which was merged during v4.18 merge window. Add MemoryMin to support memory.min.	2018-07-12 08:21:43 +02:00
Yu Watanabe	b4dec49f83	core/cgroup: drop unnecessary condition	2018-06-25 13:09:48 +09:00
Lennart Poettering	0c69794138	tree-wide: remove Lennart's copyright lines These lines are generally out-of-date, incomplete and unnecessary. With SPDX and git repository much more accurate and fine grained information about licensing and authorship is available, hence let's drop the per-file copyright notice. Of course, removing copyright lines of others is problematic, hence this commit only removes my own lines and leaves all others untouched. It might be nicer if sooner or later those could go away too, making git the only and accurate source of authorship information.	2018-06-14 10:20:20 +02:00
Lennart Poettering	818bf54632	tree-wide: drop 'This file is part of systemd' blurb This part of the copyright blurb stems from the GPL use recommendations: https://www.gnu.org/licenses/gpl-howto.en.html The concept appears to originate in times where version control was per file, instead of per tree, and was a way to glue the files together. Ultimately, we nowadays don't live in that world anymore, and this information is entirely useless anyway, as people are very welcome to copy these files into any projects they like, and they shouldn't have to change bits that are part of our copyright header for that. hence, let's just get rid of this old cruft, and shorten our codebase a bit.	2018-06-14 10:20:20 +02:00
Lennart Poettering	17ae278097	core: when applying io/blkio per-device rules, don't remove them if they fail These devices might show up later, hence leave the rules as they are. Applying the limits should not alter configuration.	2018-06-12 22:52:36 +02:00
Zbigniew Jędrzejewski-Szmek	24d169e092	Merge pull request #9255 from poettering/block-dev-fixes some block device handling fixes	2018-06-12 12:53:37 +02:00
Zbigniew Jędrzejewski-Szmek	65be7e0652	pid1: do not reset subtree_control on already-existing units with delegation Fixes #8364. Reproducer: $ sudo systemd-run -t -p Delegate=yes bash # mkdir /sys/fs/cgroup/system.slice/run-u6958.service/supervisor # echo $$ > /sys/fs/cgroup/system.slice/run-u6958.service/supervisor/cgroup.procs # echo +memory > /sys/fs/cgroup/system.slice/run-u6958.service/cgroup.subtree_control # cat /sys/fs/cgroup/system.slice/run-u6958.service/cgroup.subtree_control memory # systemctl daemon-reload # cat /sys/fs/cgroup/system.slice/run-u6958.service/cgroup.subtree_control (empty) With patch, the last command shows 'memory'.	2018-06-11 18:12:30 +02:00
Lennart Poettering	45c2e06854	cgroup: beef up device lookup logic for block devices Let's chase block devices through btrfs and LUKS like we do elsewhere.	2018-06-11 18:01:06 +02:00
Lennart Poettering	19a691a9fd	cgroup: tiny log message tweak, say that we ignore one kind of failure	2018-06-05 22:04:39 +02:00
Zbigniew Jędrzejewski-Szmek	d94a24ca2e	Add macro for checking if some flags are set This way we don't need to repeat the argument twice. I didn't replace all instances. I think it's better to leave out: - asserts - comparisons like x & y == x, which are mathematically equivalent, but here we aren't checking if flags are set, but if the argument fits in the flags.	2018-06-04 11:50:44 +02:00
Yu Watanabe	858d36c1ec	path-util: introduce path_simplify() The function is similar to path_kill_slashes() but also removes initial './', trailing '/.', and '/./' in the path. When the second argument of path_simplify() is false, then it behaves as the same as path_kill_slashes(). Hence, this also replaces path_kill_slashes() with path_simplify().	2018-06-03 23:39:26 +09:00
Zbigniew Jędrzejewski-Szmek	b1c05b98bf	tree-wide: avoid assignment of r just to use in a comparison This changes r = ...; if (r < 0) to if (... < 0) when r will not be used again.	2018-04-24 14:10:27 +02:00
Lennart Poettering	5d13a15b1d	tree-wide: drop spurious newlines (#8764 ) Double newlines (i.e. one empty lines) are great to structure code. But let's avoid triple newlines (i.e. two empty lines), quadruple newlines, quintuple newlines, …, that's just spurious whitespace. It's an easy way to drop 121 lines of code, and keeps the coding style of our sources a bit tigther.	2018-04-19 12:13:23 +02:00
Lennart Poettering	57ea45e11a	util-lib: introduce new empty_or_root() helper (#8746 ) We check the same condition at various places. Let's add a trivial, common helper for this, and use it everywhere. It's not going to make things much faster or much shorter, but I think a lot more readable	2018-04-18 14:20:49 +02:00
Zbigniew Jędrzejewski-Szmek	11a1589223	tree-wide: drop license boilerplate Files which are installed as-is (any .service and other unit files, .conf files, .policy files, etc), are left as is. My assumption is that SPDX identifiers are not yet that well known, so it's better to retain the extended header to avoid any doubt. I also kept any copyright lines. We can probably remove them, but it'd nice to obtain explicit acks from all involved authors before doing that.	2018-04-06 18:58:55 +02:00
Evgeny Vereshchagin	f6c63f6fc9	core: skip the removal of cgroups in the TEST_RUN_MINIMAL mode (#8622 ) When `systemd` is run in the TEST_RUN_MINIMAL mode, it doesn't really set up cgroups, so it shouldn't try to remove anything. Closes https://github.com/systemd/systemd/issues/8474.	2018-04-03 15:04:22 +02:00
Lennart Poettering	ae2a15bc14	macro: introduce TAKE_PTR() macro This macro will read a pointer of any type, return it, and set the pointer to NULL. This is useful as an explicit concept of passing ownership of a memory area between pointers. This takes inspiration from Rust: https://doc.rust-lang.org/std/option/enum.Option.html#method.take and was suggested by Alan Jenkins (@sourcejedi). It drops ~160 lines of code from our codebase, which makes me like it. Also, I think it clarifies passing of ownership, and thus helps readability a bit (at least for the initiated who know the new macro)	2018-03-22 20:21:42 +01:00
Michal Sekletar	aa77e234fc	core: ignore errors from cg_create_and_attach() in test mode (#8401 ) Reproducer: $ meson build && cd build $ ninja $ sudo useradd test $ sudo su test $ ./systemd --system --test ... Failed to create /user.slice/user-1000.slice/session-6.scope/init.scope control group: Permission denied Failed to allocate manager object: Permission denied Above error message is caused by the fact that user test didn't have its own session and we tried to set up init.scope already running as user test in the directory owned by different user. Let's try to setup cgroup hierarchy, but if that fails return error only when not running in the test mode. Fixes #8072	2018-03-09 23:30:32 +01:00
Lennart Poettering	902c8502ad	Merge pull request #8149 from poettering/fake-root-cgroup Properly synthesize CPU+memory accounting data for the root cgroup	2018-03-01 11:10:24 +01:00
Lennart Poettering	acf7f253de	bpf: use BPF_F_ALLOW_MULTI flag if it is available This new kernel 4.15 flag permits that multiple BPF programs can be executed for each packet processed: multiple per cgroup plus all programs defined up the tree on all parent cgroups. We can use this for two features: 1. Finally provide per-slice IP accounting (which was previously unavailable) 2. Permit delegation of BPF programs to services (i.e. leaf nodes). This patch beefs up PID1's handling of BPF to enable both. Note two special items to keep in mind: a. Our inner-node BPF programs (i.e. the ones we attach to slices) do not enforce IP access lists, that's done exclsuively in the leaf-node BPF programs. That's a good thing, since that way rules in leaf nodes can cancel out rules further up (i.e. for example to implement a logic of "disallow everything except httpd.service"). Inner node BPF programs to accounting however if that's requested. This is beneficial for performance reasons: it means in order to provide per-slice IP accounting we don't have to add up all child unit's data. b. When this code is run on pre-4.15 kernel (i.e. where BPF_F_ALLOW_MULTI is not available) we'll make IP acocunting on slice units unavailable (i.e. revert to behaviour from before this commit). For leaf nodes we'll fallback to non-ALLOW_MULTI mode however, which means that BPF delegation is not available there at all, if IP fw/acct is turned on for the unit. This is a change from earlier behaviour, where we use the BPF_F_ALLOW_OVERRIDE flag, so that our fw/acct would lose its effect as soon as delegation was turned on and some client made use of that. I think the new behaviour is the safer choice in this case, as silent bypassing of our fw rules is not possible anymore. And if people want proper delegation then the way out is a more modern kernel or turning off IP firewalling/acct for the unit algother.	2018-02-21 16:43:36 +01:00
Lennart Poettering	6592b9759c	core: add new new bus call for migrating foreign processes to scope/service units This adds a new bus call to service and scope units called AttachProcesses() that moves arbitrary processes into the cgroup of the unit. The primary user for this new API is systemd itself: the systemd --user instance uses this call of the systemd --system instance to migrate processes if itself gets the request to migrate processes and the kernel refuses this due to access restrictions. The primary use-case of this is to make "systemd-run --scope --user …" invoked from user session scopes work correctly on pure cgroupsv2 environments. There, the kernel refuses to migrate processes between two unprivileged-owned cgroups unless the requestor as well as the ownership of the closest parent cgroup all match. This however is not the case between the session-XYZ.scope unit of a login session and the user@ABC.service of the systemd --user instance. The new logic always tries to move the processes on its own, but if that doesn't work when being the user manager, then the system manager is asked to do it instead. The new operation is relatively restrictive: it will only allow to move the processes like this if the caller is root, or the UID of the target unit, caller and process all match. Note that this means that unprivileged users cannot attach processes to scope units, as those do not have "owning" users (i.e. they have now User= field). Fixes: #3388	2018-02-12 11:34:00 +01:00
Lennart Poettering	1d9cc8768f	cgroup: add a new "can_delegate" flag to the unit vtable, and set it for scope and service units only Currently we allowed delegation for alluntis with cgroup backing except for slices. Let's make this a bit more strict for now, and only allow this in service and scope units. Let's also add a generic accessor unit_cgroup_delegate() for checking whether a unit has delegation turned on that checks the new bool first. Also, when doing transient units, let's explcitly refuse turning on delegation for unit types that don#t support it. This is mostly cosmetical as we wouldn't act on the delegation request anyway, but certainly helpful for debugging.	2018-02-12 11:34:00 +01:00
Lennart Poettering	cc6271f17d	core: turn on memory/cpu/tasks accounting by default for the root slice The kernel exposes the necessary data in /proc anyway, let's expose it hence by default. With this in place "systemctl status -- -.slice" will show accounting data out-of-the-box now.	2018-02-09 19:07:39 +01:00
Lennart Poettering	1f73aa0021	core: hook up /proc queries for the root slice, too Do what we already prepped in cgtop for the root slice in PID 1 too: consult /proc for the data we need.	2018-02-09 19:05:59 +01:00
Lennart Poettering	b734a4ff14	cgroup-util: rework cg_get_keyed_attribute() a bit Let's make sure we don't clobber the return parameter on failure, to follow our coding style. Also, break the loop early if we have all attributes we need. This also changes the keys parameter to a simple char**, so that we can use STRV_MAKE() for passing the list of attributes to read. This also makes it possible to distuingish the case when the whole attribute file doesn't exist from one key in it missing. In the former case we return -ENOENT, in the latter we now return -ENXIO.	2018-02-09 18:35:52 +01:00

1 2 3 4 5 ...

340 commits