Systemd

Author	SHA1	Message	Date
Lennart Poettering	913c898ca0	cgroup: voidify a few things	2018-10-13 12:37:13 +02:00
Lennart Poettering	b9839ac9d9	cgroup: make sure whitelist_device() always returns a valid return value CID 1396094	2018-10-13 12:37:13 +02:00
Zbigniew Jędrzejewski-Szmek	f436470ae1	Merge pull request #10343 from poettering/manager-state-fix various fixes for PID1's Manager object	2018-10-10 12:36:16 +02:00
Lennart Poettering	638cece45d	core: clean up test run flags Let's make them typesafe, and let's add a nice macro helper for checking if we are in a test run, which should make testing for this much easier to read for most cases.	2018-10-09 19:43:43 +02:00
Roman Gushchin	084c700780	core: support cgroup v2 device controller Cgroup v2 provides the eBPF-based device controller, which isn't currently supported by systemd. This commit aims to provide such support. There are no user-visible changes, just the device policy and whitelist start working if cgroup v2 is used.	2018-10-09 09:47:51 -07:00
Roman Gushchin	17f149556a	core: refactor bpf firewall support into a pseudo-controller The idea is to introduce a concept of bpf-based pseudo-controllers to make adding new bpf-based features easier.	2018-10-09 09:46:08 -07:00
Tejun Heo	6ae4283cb1	core: add IODeviceLatencyTargetSec This adds support for the following proposed latency based IO control mechanism. https://lkml.org/lkml/2018/6/5/428	2018-08-22 16:46:18 +02:00
Yu Watanabe	fd870bac25	core: introduce cgroup_add_device_allow()	2018-08-06 13:42:14 +09:00
Tejun Heo	4842263577	core: add MemoryMin The kernel added support for a new cgroup memory controller knob memory.min in bf8d5d52ffe8 ("memcg: introduce memory.min") which was merged during v4.18 merge window. Add MemoryMin to support memory.min.	2018-07-12 08:21:43 +02:00
Yu Watanabe	b4dec49f83	core/cgroup: drop unnecessary condition	2018-06-25 13:09:48 +09:00
Lennart Poettering	0c69794138	tree-wide: remove Lennart's copyright lines These lines are generally out-of-date, incomplete and unnecessary. With SPDX and git repository much more accurate and fine grained information about licensing and authorship is available, hence let's drop the per-file copyright notice. Of course, removing copyright lines of others is problematic, hence this commit only removes my own lines and leaves all others untouched. It might be nicer if sooner or later those could go away too, making git the only and accurate source of authorship information.	2018-06-14 10:20:20 +02:00
Lennart Poettering	818bf54632	tree-wide: drop 'This file is part of systemd' blurb This part of the copyright blurb stems from the GPL use recommendations: https://www.gnu.org/licenses/gpl-howto.en.html The concept appears to originate in times where version control was per file, instead of per tree, and was a way to glue the files together. Ultimately, we nowadays don't live in that world anymore, and this information is entirely useless anyway, as people are very welcome to copy these files into any projects they like, and they shouldn't have to change bits that are part of our copyright header for that. hence, let's just get rid of this old cruft, and shorten our codebase a bit.	2018-06-14 10:20:20 +02:00
Lennart Poettering	17ae278097	core: when applying io/blkio per-device rules, don't remove them if they fail These devices might show up later, hence leave the rules as they are. Applying the limits should not alter configuration.	2018-06-12 22:52:36 +02:00
Zbigniew Jędrzejewski-Szmek	24d169e092	Merge pull request #9255 from poettering/block-dev-fixes some block device handling fixes	2018-06-12 12:53:37 +02:00
Zbigniew Jędrzejewski-Szmek	65be7e0652	pid1: do not reset subtree_control on already-existing units with delegation Fixes #8364. Reproducer: $ sudo systemd-run -t -p Delegate=yes bash # mkdir /sys/fs/cgroup/system.slice/run-u6958.service/supervisor # echo $$ > /sys/fs/cgroup/system.slice/run-u6958.service/supervisor/cgroup.procs # echo +memory > /sys/fs/cgroup/system.slice/run-u6958.service/cgroup.subtree_control # cat /sys/fs/cgroup/system.slice/run-u6958.service/cgroup.subtree_control memory # systemctl daemon-reload # cat /sys/fs/cgroup/system.slice/run-u6958.service/cgroup.subtree_control (empty) With patch, the last command shows 'memory'.	2018-06-11 18:12:30 +02:00
Lennart Poettering	45c2e06854	cgroup: beef up device lookup logic for block devices Let's chase block devices through btrfs and LUKS like we do elsewhere.	2018-06-11 18:01:06 +02:00
Lennart Poettering	19a691a9fd	cgroup: tiny log message tweak, say that we ignore one kind of failure	2018-06-05 22:04:39 +02:00
Zbigniew Jędrzejewski-Szmek	d94a24ca2e	Add macro for checking if some flags are set This way we don't need to repeat the argument twice. I didn't replace all instances. I think it's better to leave out: - asserts - comparisons like x & y == x, which are mathematically equivalent, but here we aren't checking if flags are set, but if the argument fits in the flags.	2018-06-04 11:50:44 +02:00
Yu Watanabe	858d36c1ec	path-util: introduce path_simplify() The function is similar to path_kill_slashes() but also removes initial './', trailing '/.', and '/./' in the path. When the second argument of path_simplify() is false, then it behaves as the same as path_kill_slashes(). Hence, this also replaces path_kill_slashes() with path_simplify().	2018-06-03 23:39:26 +09:00
Zbigniew Jędrzejewski-Szmek	b1c05b98bf	tree-wide: avoid assignment of r just to use in a comparison This changes r = ...; if (r < 0) to if (... < 0) when r will not be used again.	2018-04-24 14:10:27 +02:00
Lennart Poettering	5d13a15b1d	tree-wide: drop spurious newlines (#8764 ) Double newlines (i.e. one empty lines) are great to structure code. But let's avoid triple newlines (i.e. two empty lines), quadruple newlines, quintuple newlines, …, that's just spurious whitespace. It's an easy way to drop 121 lines of code, and keeps the coding style of our sources a bit tigther.	2018-04-19 12:13:23 +02:00
Lennart Poettering	57ea45e11a	util-lib: introduce new empty_or_root() helper (#8746 ) We check the same condition at various places. Let's add a trivial, common helper for this, and use it everywhere. It's not going to make things much faster or much shorter, but I think a lot more readable	2018-04-18 14:20:49 +02:00
Zbigniew Jędrzejewski-Szmek	11a1589223	tree-wide: drop license boilerplate Files which are installed as-is (any .service and other unit files, .conf files, .policy files, etc), are left as is. My assumption is that SPDX identifiers are not yet that well known, so it's better to retain the extended header to avoid any doubt. I also kept any copyright lines. We can probably remove them, but it'd nice to obtain explicit acks from all involved authors before doing that.	2018-04-06 18:58:55 +02:00
Evgeny Vereshchagin	f6c63f6fc9	core: skip the removal of cgroups in the TEST_RUN_MINIMAL mode (#8622 ) When `systemd` is run in the TEST_RUN_MINIMAL mode, it doesn't really set up cgroups, so it shouldn't try to remove anything. Closes https://github.com/systemd/systemd/issues/8474.	2018-04-03 15:04:22 +02:00
Lennart Poettering	ae2a15bc14	macro: introduce TAKE_PTR() macro This macro will read a pointer of any type, return it, and set the pointer to NULL. This is useful as an explicit concept of passing ownership of a memory area between pointers. This takes inspiration from Rust: https://doc.rust-lang.org/std/option/enum.Option.html#method.take and was suggested by Alan Jenkins (@sourcejedi). It drops ~160 lines of code from our codebase, which makes me like it. Also, I think it clarifies passing of ownership, and thus helps readability a bit (at least for the initiated who know the new macro)	2018-03-22 20:21:42 +01:00
Michal Sekletar	aa77e234fc	core: ignore errors from cg_create_and_attach() in test mode (#8401 ) Reproducer: $ meson build && cd build $ ninja $ sudo useradd test $ sudo su test $ ./systemd --system --test ... Failed to create /user.slice/user-1000.slice/session-6.scope/init.scope control group: Permission denied Failed to allocate manager object: Permission denied Above error message is caused by the fact that user test didn't have its own session and we tried to set up init.scope already running as user test in the directory owned by different user. Let's try to setup cgroup hierarchy, but if that fails return error only when not running in the test mode. Fixes #8072	2018-03-09 23:30:32 +01:00
Lennart Poettering	902c8502ad	Merge pull request #8149 from poettering/fake-root-cgroup Properly synthesize CPU+memory accounting data for the root cgroup	2018-03-01 11:10:24 +01:00
Lennart Poettering	acf7f253de	bpf: use BPF_F_ALLOW_MULTI flag if it is available This new kernel 4.15 flag permits that multiple BPF programs can be executed for each packet processed: multiple per cgroup plus all programs defined up the tree on all parent cgroups. We can use this for two features: 1. Finally provide per-slice IP accounting (which was previously unavailable) 2. Permit delegation of BPF programs to services (i.e. leaf nodes). This patch beefs up PID1's handling of BPF to enable both. Note two special items to keep in mind: a. Our inner-node BPF programs (i.e. the ones we attach to slices) do not enforce IP access lists, that's done exclsuively in the leaf-node BPF programs. That's a good thing, since that way rules in leaf nodes can cancel out rules further up (i.e. for example to implement a logic of "disallow everything except httpd.service"). Inner node BPF programs to accounting however if that's requested. This is beneficial for performance reasons: it means in order to provide per-slice IP accounting we don't have to add up all child unit's data. b. When this code is run on pre-4.15 kernel (i.e. where BPF_F_ALLOW_MULTI is not available) we'll make IP acocunting on slice units unavailable (i.e. revert to behaviour from before this commit). For leaf nodes we'll fallback to non-ALLOW_MULTI mode however, which means that BPF delegation is not available there at all, if IP fw/acct is turned on for the unit. This is a change from earlier behaviour, where we use the BPF_F_ALLOW_OVERRIDE flag, so that our fw/acct would lose its effect as soon as delegation was turned on and some client made use of that. I think the new behaviour is the safer choice in this case, as silent bypassing of our fw rules is not possible anymore. And if people want proper delegation then the way out is a more modern kernel or turning off IP firewalling/acct for the unit algother.	2018-02-21 16:43:36 +01:00
Lennart Poettering	6592b9759c	core: add new new bus call for migrating foreign processes to scope/service units This adds a new bus call to service and scope units called AttachProcesses() that moves arbitrary processes into the cgroup of the unit. The primary user for this new API is systemd itself: the systemd --user instance uses this call of the systemd --system instance to migrate processes if itself gets the request to migrate processes and the kernel refuses this due to access restrictions. The primary use-case of this is to make "systemd-run --scope --user …" invoked from user session scopes work correctly on pure cgroupsv2 environments. There, the kernel refuses to migrate processes between two unprivileged-owned cgroups unless the requestor as well as the ownership of the closest parent cgroup all match. This however is not the case between the session-XYZ.scope unit of a login session and the user@ABC.service of the systemd --user instance. The new logic always tries to move the processes on its own, but if that doesn't work when being the user manager, then the system manager is asked to do it instead. The new operation is relatively restrictive: it will only allow to move the processes like this if the caller is root, or the UID of the target unit, caller and process all match. Note that this means that unprivileged users cannot attach processes to scope units, as those do not have "owning" users (i.e. they have now User= field). Fixes: #3388	2018-02-12 11:34:00 +01:00
Lennart Poettering	1d9cc8768f	cgroup: add a new "can_delegate" flag to the unit vtable, and set it for scope and service units only Currently we allowed delegation for alluntis with cgroup backing except for slices. Let's make this a bit more strict for now, and only allow this in service and scope units. Let's also add a generic accessor unit_cgroup_delegate() for checking whether a unit has delegation turned on that checks the new bool first. Also, when doing transient units, let's explcitly refuse turning on delegation for unit types that don#t support it. This is mostly cosmetical as we wouldn't act on the delegation request anyway, but certainly helpful for debugging.	2018-02-12 11:34:00 +01:00
Lennart Poettering	cc6271f17d	core: turn on memory/cpu/tasks accounting by default for the root slice The kernel exposes the necessary data in /proc anyway, let's expose it hence by default. With this in place "systemctl status -- -.slice" will show accounting data out-of-the-box now.	2018-02-09 19:07:39 +01:00
Lennart Poettering	1f73aa0021	core: hook up /proc queries for the root slice, too Do what we already prepped in cgtop for the root slice in PID 1 too: consult /proc for the data we need.	2018-02-09 19:05:59 +01:00
Lennart Poettering	b734a4ff14	cgroup-util: rework cg_get_keyed_attribute() a bit Let's make sure we don't clobber the return parameter on failure, to follow our coding style. Also, break the loop early if we have all attributes we need. This also changes the keys parameter to a simple char**, so that we can use STRV_MAKE() for passing the list of attributes to read. This also makes it possible to distuingish the case when the whole attribute file doesn't exist from one key in it missing. In the former case we return -ENOENT, in the latter we now return -ENXIO.	2018-02-09 18:35:52 +01:00
Zbigniew Jędrzejewski-Szmek	f26f5b60d0	Merge pull request #7915 from poettering/pids-max-tweak	2018-01-25 10:24:35 +01:00
Lennart Poettering	62a769136d	core: rework how we track which PIDs to watch for a unit Previously, we'd maintain two hashmaps keyed by PIDs, pointing to Unit interested in SIGCHLD events for them. This scheme allowed a specific PID to be watched by exactly 0, 1 or 2 units. With this rework this is replaced by a single hashmap which is primarily keyed by the PID and points to a Unit interested in it. However, it optionally also keyed by the negated PID, in which case it points to a NULL terminated array of additional Unit objects also interested. This scheme means arbitrary numbers of Units may now watch the same PID. Runtime and memory behaviour should not be impact by this change, as for the common case (i.e. each PID only watched by a single unit) behaviour stays the same, but for the uncommon case (a PID watched by more than one unit) we only pay with a single additional memory allocation for the array. Why this all? Primarily, because allowing exactly two units to watch a specific PID is not sufficient for some niche cases, as processes can belong to more than one unit these days: 1. sd_notify() with MAINPID= can be used to attach a process from a different cgroup to multiple units. 2. Similar, the PIDFile= setting in unit files can be used for similar setups, 3. By creating a scope unit a main process of a service may join a different unit, too. 4. On cgroupsv1 we frequently end up watching all processes remaining in a scope, and if a process opens lots of scopes one after the other it might thus end up being watch by many of them. This patch hence removes the 2-unit-per-PID limit. It also makes a couple of other changes, some of them quite relevant: - manager_get_unit_by_pid() (and the bus call wrapping it) when there's ambiguity will prefer returning the Unit the process belongs to based on cgroup membership, and only check the watch-pids hashmap if that fails. This change in logic is probably more in line with what people expect and makes things more stable as each process can belong to exactly one cgroup only. - Every SIGCHLD event is now dispatched to all units interested in its PID. Previously, there was some magic conditionalization: the SIGCHLD would only be dispatched to the unit if it was only interested in a single PID only, or the PID belonged to the control or main PID or we didn't dispatch a signle SIGCHLD to the unit in the current event loop iteration yet. These rules were quite arbitrary and also redundant as the the per-unit handlers would filter the PIDs anyway a second time. With this change we'll hence relax the rules: all we do now is dispatch every SIGCHLD event exactly once to each unit interested in it, and it's up to the unit to then use or ignore this. We use a generation counter in the unit to ensure that we only invoke the unit handler once for each event, protecting us from confusion if a unit is both associated with a specific PID through cgroup membership and through the "watch_pids" logic. It also protects us from being confused if the "watch_pids" hashmap is altered while we are dispatching to it (which is a very likely case). - sd_notify() message dispatching has been reworked to be very similar to SIGCHLD handling now. A generation counter is used for dispatching as well. This also adds a new test that validates that "watch_pid" registration and unregstration works correctly.	2018-01-23 21:29:31 +01:00
Lennart Poettering	11aef522c1	core: unify call we use to synthesize cgroup empty events when we stopped watching any unit PIDs This code is very similar in scope and service units, let's unify it in one function. This changes little for service units, but for scope units makes sure we go through the cgroup queue, which is something we should do anyway.	2018-01-23 21:22:50 +01:00
Lennart Poettering	2ca9d97943	core: fix manager_get_unit_by_pid() special casing of manager PID Previously, we'd hard map PID 1 to the manager scope unit. That's wrong however when we are run in --user mode, as the PID 1 is outside of the subtree we manage and the manager PID might be very differently. Correct that by checking for getpid() rather than hardcoding 1.	2018-01-23 21:22:50 +01:00
Lennart Poettering	00b5974f70	core: propagate TasksMax= on the root slice to sysctls The cgroup "pids" controller is not supported on the root cgroup. However we expose TasksMax= on it, but currently don't actually apply it to anything. Let's correct this: if set, let's propagate things to the right sysctls. This way we can expose TasksMax= on all units in a somewhat sensible way.	2018-01-22 16:26:55 +01:00
Lennart Poettering	c36a69f4cd	cgroup: when querying the number of tasks in the root slice use the pid_max sysctl The root cgroup doesn't expose and properties in the "pids" cgroup controller, hence we need to get the data from somewhere else.	2018-01-22 16:26:55 +01:00
Lennart Poettering	f3725e64fe	cgroup: add proper API to determine whether our unit manags to root cgroup	2018-01-22 16:26:55 +01:00
Lennart Poettering	8793fa2565	cgroup: use CGROUP_LIMIT_MAX where appropriate	2018-01-22 16:22:03 +01:00
Alan Jenkins	5a7f87a9e0	core: un-break PrivateDevices= by allowing it to mknod /dev/ptmx #7886 caused PrivateDevices= to silently fail-open. https://github.com/systemd/systemd/pull/7886#issuecomment-358542849 Allow PrivateDevices= to succeed, in creating /dev/ptmx, even though DeviceControl=closed applies. No specific justification was given for blocking mknod of /dev/ptmx. Only that we didn't seem to need it, because we weren't creating it correctly as a device node.	2018-01-18 12:10:20 +00:00
Lennart Poettering	18c528e99f	basic: split out blockdev-util.[ch] from util.h With three functions it makes sense to split this out now.	2017-12-25 11:48:21 +01:00
Lennart Poettering	a4634b214c	core: warn about left-over processes in cgroup on unit start Now that we don't kill control processes anymore, let's at least warn about any processes left-over in the unit cgroup at the moment of starting the unit.	2017-11-25 17:08:21 +01:00
Lennart Poettering	60c728adf7	unit: initialize bpf cgroup realization state properly Before this patch, the bpf cgroup realization state was implicitly set to "NO", meaning that the bpf configuration was realized but was turned off. That means invalidation requests for the bpf stuff (which we issue in blanket fashion when doing a daemon reload) would actually later result in a us re-realizing the unit, under the assumption it was already realized once, even though in reality it never was realized before. This had the effect that after each daemon-reload we'd end up realizing all defined units, even the unloaded ones, populating cgroupfs with lots of unneeded empty cgroups. With this fix we properly set the realiazation state to "INVALIDATED", i.e. indicating the bpf stuff was never set up for the unit, and hence when we try to invalidate it later we won't do anything.	2017-11-25 17:08:21 +01:00
Lennart Poettering	2aa57a6550	cgroup: when dispatching the cgroup realization queue, check again if we shall actually realize We add units to the cgroup realization queue when propagating realizing requests to sibling units, and when invalidating cgroup settings because some cgroup setting changed. In the time between where we add the unit to the queue until the cgroup is actually dispatched the unit's state might have changed however, so that the unit doesn't actually need to be realized anymore, for example because the unit went down. To handle that, check the unit state again, if realization makes sense. Redundant realization is usually not a problem, except when the unit is not actually running, hence check exactly for that.	2017-11-25 17:08:21 +01:00
Lennart Poettering	0f2d84d2cc	cgroup: drop unused parameter from function	2017-11-25 17:08:21 +01:00
Evgeny Vereshchagin	0fb8449930	cgroup: downgrade the log level of "invocation id" messages to debug (#7422 ) Now that `d3070fbdf6` has been merged, these errors are not as critical as they used to be.	2017-11-23 11:07:20 +01:00
Lennart Poettering	64e844e5ca	cgroup: fix delegation on the unified hierarchy Make sure to add the delegation mask to the mask of controllers we have to enable on our own unit. Do not claim it was a members mask, as such a logic would mean we'd collide with cgroupv2's "no processes on inner nodes policy". This change does the right thing: it means any controller enabled through Controllers= will be made available to subcrgoups of our unit, but the unit itself has to still enable it through cgroup.subtree_control (which it can since that file is delegated too) to be inherited further down. Or to say this differently: we only should manipulate cgroup.subtree_control ourselves for inner nodes (i.e. slices), and for leaves we need to provide a way to enable controllers in the slices above, but stay away from the cgroup's own cgroup.subtree_control — which is what this patch ensures. Fixes: #7355	2017-11-21 11:54:08 +01:00
Zbigniew Jędrzejewski-Szmek	53e1b68390	Add SPDX license identifiers to source files under the LGPL This follows what the kernel is doing, c.f. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.	2017-11-19 19:08:15 +01:00

1 2 3 4 5

223 commits