Commit Graph

199 Commits

Author SHA1 Message Date
Lennart Poettering ae2a15bc14 macro: introduce TAKE_PTR() macro
This macro will read a pointer of any type, return it, and set the
pointer to NULL. This is useful as an explicit concept of passing
ownership of a memory area between pointers.

This takes inspiration from Rust:

https://doc.rust-lang.org/std/option/enum.Option.html#method.take

and was suggested by Alan Jenkins (@sourcejedi).

It drops ~160 lines of code from our codebase, which makes me like it.
Also, I think it clarifies passing of ownership, and thus helps
readability a bit (at least for the initiated who know the new macro)
2018-03-22 20:21:42 +01:00
Michal Sekletar aa77e234fc core: ignore errors from cg_create_and_attach() in test mode (#8401)
Reproducer:

$ meson build && cd build
$ ninja
$ sudo useradd test
$ sudo su test
$ ./systemd --system --test
...
Failed to create /user.slice/user-1000.slice/session-6.scope/init.scope control group: Permission denied
Failed to allocate manager object: Permission denied

Above error message is caused by the fact that user test didn't have its
own session and we tried to set up init.scope already running as user
test in the directory owned by different user.

Let's try to setup cgroup hierarchy, but if that fails return error only
when not running in the test mode.

Fixes #8072
2018-03-09 23:30:32 +01:00
Lennart Poettering 902c8502ad
Merge pull request #8149 from poettering/fake-root-cgroup
Properly synthesize CPU+memory accounting data for the root cgroup
2018-03-01 11:10:24 +01:00
Lennart Poettering acf7f253de bpf: use BPF_F_ALLOW_MULTI flag if it is available
This new kernel 4.15 flag permits that multiple BPF programs can be
executed for each packet processed: multiple per cgroup plus all
programs defined up the tree on all parent cgroups.

We can use this for two features:

1. Finally provide per-slice IP accounting (which was previously
   unavailable)

2. Permit delegation of BPF programs to services (i.e. leaf nodes).

This patch beefs up PID1's handling of BPF to enable both.

Note two special items to keep in mind:

a. Our inner-node BPF programs (i.e. the ones we attach to slices) do
   not enforce IP access lists, that's done exclsuively in the leaf-node
   BPF programs. That's a good thing, since that way rules in leaf nodes
   can cancel out rules further up (i.e. for example to implement a
   logic of "disallow everything except httpd.service"). Inner node BPF
   programs to accounting however if that's requested. This is
   beneficial for performance reasons: it means in order to provide
   per-slice IP accounting we don't have to add up all child unit's
   data.

b. When this code is run on pre-4.15 kernel (i.e. where
   BPF_F_ALLOW_MULTI is not available) we'll make IP acocunting on slice
   units unavailable (i.e. revert to behaviour from before this commit).
   For leaf nodes we'll fallback to non-ALLOW_MULTI mode however, which
   means that BPF delegation is not available there at all, if IP
   fw/acct is turned on for the unit. This is a change from earlier
   behaviour, where we use the BPF_F_ALLOW_OVERRIDE flag, so that our
   fw/acct would lose its effect as soon as delegation was turned on and
   some client made use of that. I think the new behaviour is the safer
   choice in this case, as silent bypassing of our fw rules is not
   possible anymore. And if people want proper delegation then the way
   out is a more modern kernel or turning off IP firewalling/acct for
   the unit algother.
2018-02-21 16:43:36 +01:00
Lennart Poettering 6592b9759c core: add new new bus call for migrating foreign processes to scope/service units
This adds a new bus call to service and scope units called
AttachProcesses() that moves arbitrary processes into the cgroup of the
unit. The primary user for this new API is systemd itself: the systemd
--user instance uses this call of the systemd --system instance to
migrate processes if itself gets the request to migrate processes and
the kernel refuses this due to access restrictions.

The primary use-case of this is to make "systemd-run --scope --user …"
invoked from user session scopes work correctly on pure cgroupsv2
environments. There, the kernel refuses to migrate processes between two
unprivileged-owned cgroups unless the requestor as well as the ownership
of the closest parent cgroup all match. This however is not the case
between the session-XYZ.scope unit of a login session and the
user@ABC.service of the systemd --user instance.

The new logic always tries to move the processes on its own, but if
that doesn't work when being the user manager, then the system manager
is asked to do it instead.

The new operation is relatively restrictive: it will only allow to move
the processes like this if the caller is root, or the UID of the target
unit, caller and process all match. Note that this means that
unprivileged users cannot attach processes to scope units, as those do
not have "owning" users (i.e. they have now User= field).

Fixes: #3388
2018-02-12 11:34:00 +01:00
Lennart Poettering 1d9cc8768f cgroup: add a new "can_delegate" flag to the unit vtable, and set it for scope and service units only
Currently we allowed delegation for alluntis with cgroup backing
except for slices. Let's make this a bit more strict for now, and only
allow this in service and scope units.

Let's also add a generic accessor unit_cgroup_delegate() for checking
whether a unit has delegation turned on that checks the new bool first.

Also, when doing transient units, let's explcitly refuse turning on
delegation for unit types that don#t support it. This is mostly
cosmetical as we wouldn't act on the delegation request anyway, but
certainly helpful for debugging.
2018-02-12 11:34:00 +01:00
Lennart Poettering cc6271f17d core: turn on memory/cpu/tasks accounting by default for the root slice
The kernel exposes the necessary data in /proc anyway, let's expose it
hence by default.

With this in place "systemctl status -- -.slice" will show accounting
data out-of-the-box now.
2018-02-09 19:07:39 +01:00
Lennart Poettering 1f73aa0021 core: hook up /proc queries for the root slice, too
Do what we already prepped in cgtop for the root slice in PID 1 too:
consult /proc for the data we need.
2018-02-09 19:05:59 +01:00
Lennart Poettering b734a4ff14 cgroup-util: rework cg_get_keyed_attribute() a bit
Let's make sure we don't clobber the return parameter on failure, to
follow our coding style. Also, break the loop early if we have all
attributes we need.

This also changes the keys parameter to a simple char**, so that we can
use STRV_MAKE() for passing the list of attributes to read.

This also makes it possible to distuingish the case when the whole
attribute file doesn't exist from one key in it missing. In the former
case we return -ENOENT, in the latter we now return -ENXIO.
2018-02-09 18:35:52 +01:00
Zbigniew Jędrzejewski-Szmek f26f5b60d0 Merge pull request #7915 from poettering/pids-max-tweak 2018-01-25 10:24:35 +01:00
Lennart Poettering 62a769136d core: rework how we track which PIDs to watch for a unit
Previously, we'd maintain two hashmaps keyed by PIDs, pointing to Unit
interested in SIGCHLD events for them. This scheme allowed a specific
PID to be watched by exactly 0, 1 or 2 units.

With this rework this is replaced by a single hashmap which is primarily
keyed by the PID and points to a Unit interested in it. However, it
optionally also keyed by the negated PID, in which case it points to a
NULL terminated array of additional Unit objects also interested. This
scheme means arbitrary numbers of Units may now watch the same PID.

Runtime and memory behaviour should not be impact by this change, as for
the common case (i.e. each PID only watched by a single unit) behaviour
stays the same, but for the uncommon case (a PID watched by more than
one unit) we only pay with a single additional memory allocation for the
array.

Why this all? Primarily, because allowing exactly two units to watch a
specific PID is not sufficient for some niche cases, as processes can
belong to more than one unit these days:

1. sd_notify() with MAINPID= can be used to attach a process from a
   different cgroup to multiple units.

2. Similar, the PIDFile= setting in unit files can be used for similar
   setups,

3. By creating a scope unit a main process of a service may join a
   different unit, too.

4. On cgroupsv1 we frequently end up watching all processes remaining in
   a scope, and if a process opens lots of scopes one after the other it
   might thus end up being watch by many of them.

This patch hence removes the 2-unit-per-PID limit. It also makes a
couple of other changes, some of them quite relevant:

- manager_get_unit_by_pid() (and the bus call wrapping it) when there's
  ambiguity will prefer returning the Unit the process belongs to based on
  cgroup membership, and only check the watch-pids hashmap if that
  fails. This change in logic is probably more in line with what people
  expect and makes things more stable as each process can belong to
  exactly one cgroup only.

- Every SIGCHLD event is now dispatched to all units interested in its
  PID. Previously, there was some magic conditionalization: the SIGCHLD
  would only be dispatched to the unit if it was only interested in a
  single PID only, or the PID belonged to the control or main PID or we
  didn't dispatch a signle SIGCHLD to the unit in the current event loop
  iteration yet. These rules were quite arbitrary and also redundant as
  the the per-unit handlers would filter the PIDs anyway a second time.
  With this change we'll hence relax the rules: all we do now is
  dispatch every SIGCHLD event exactly once to each unit interested in
  it, and it's up to the unit to then use or ignore this. We use a
  generation counter in the unit to ensure that we only invoke the unit
  handler once for each event, protecting us from confusion if a unit is
  both associated with a specific PID through cgroup membership and
  through the "watch_pids" logic. It also protects us from being
  confused if the "watch_pids" hashmap is altered while we are
  dispatching to it (which is a very likely case).

- sd_notify() message dispatching has been reworked to be very similar
  to SIGCHLD handling now. A generation counter is used for dispatching
  as well.

This also adds a new test that validates that "watch_pid" registration
and unregstration works correctly.
2018-01-23 21:29:31 +01:00
Lennart Poettering 11aef522c1 core: unify call we use to synthesize cgroup empty events when we stopped watching any unit PIDs
This code is very similar in scope and service units, let's unify it in
one function. This changes little for service units, but for scope units
makes sure we go through the cgroup queue, which is something we should
do anyway.
2018-01-23 21:22:50 +01:00
Lennart Poettering 2ca9d97943 core: fix manager_get_unit_by_pid() special casing of manager PID
Previously, we'd hard map PID 1 to the manager scope unit. That's wrong
however when we are run in --user mode, as the PID 1 is outside of the
subtree we manage and the manager PID might be very differently. Correct
that by checking for getpid() rather than hardcoding 1.
2018-01-23 21:22:50 +01:00
Lennart Poettering 00b5974f70 core: propagate TasksMax= on the root slice to sysctls
The cgroup "pids" controller is not supported on the root cgroup.
However we expose TasksMax= on it, but currently don't actually apply it
to anything. Let's correct this: if set, let's propagate things to the
right sysctls.

This way we can expose TasksMax= on all units in a somewhat sensible
way.
2018-01-22 16:26:55 +01:00
Lennart Poettering c36a69f4cd cgroup: when querying the number of tasks in the root slice use the pid_max sysctl
The root cgroup doesn't expose and properties in the "pids" cgroup
controller, hence we need to get the data from somewhere else.
2018-01-22 16:26:55 +01:00
Lennart Poettering f3725e64fe cgroup: add proper API to determine whether our unit manags to root cgroup 2018-01-22 16:26:55 +01:00
Lennart Poettering 8793fa2565 cgroup: use CGROUP_LIMIT_MAX where appropriate 2018-01-22 16:22:03 +01:00
Alan Jenkins 5a7f87a9e0 core: un-break PrivateDevices= by allowing it to mknod /dev/ptmx
#7886 caused PrivateDevices= to silently fail-open.
https://github.com/systemd/systemd/pull/7886#issuecomment-358542849

Allow PrivateDevices= to succeed, in creating /dev/ptmx, even though
DeviceControl=closed applies.

No specific justification was given for blocking mknod of /dev/ptmx.  Only
that we didn't seem to need it, because we weren't creating it correctly as
a device node.
2018-01-18 12:10:20 +00:00
Lennart Poettering 18c528e99f basic: split out blockdev-util.[ch] from util.h
With three functions it makes sense to split this out now.
2017-12-25 11:48:21 +01:00
Lennart Poettering a4634b214c core: warn about left-over processes in cgroup on unit start
Now that we don't kill control processes anymore, let's at least warn
about any processes left-over in the unit cgroup at the moment of
starting the unit.
2017-11-25 17:08:21 +01:00
Lennart Poettering 60c728adf7 unit: initialize bpf cgroup realization state properly
Before this patch, the bpf cgroup realization state was implicitly set
to "NO", meaning that the bpf configuration was realized but was turned
off. That means invalidation requests for the bpf stuff (which we issue
in blanket fashion when doing a daemon reload) would actually later
result in a us re-realizing the unit, under the assumption it was
already realized once, even though in reality it never was realized
before.

This had the effect that after each daemon-reload we'd end up realizing
*all* defined units, even the unloaded ones, populating cgroupfs with
lots of unneeded empty cgroups.

With this fix we properly set the realiazation state to "INVALIDATED",
i.e. indicating the bpf stuff was never set up for the unit, and hence
when we try to invalidate it later we won't do anything.
2017-11-25 17:08:21 +01:00
Lennart Poettering 2aa57a6550 cgroup: when dispatching the cgroup realization queue, check again if we shall actually realize
We add units to the cgroup realization queue when propagating realizing
requests to sibling units, and when invalidating cgroup settings because
some cgroup setting changed. In the time between where we add the unit
to the queue until the cgroup is actually dispatched the unit's state
might have changed however, so that the unit doesn't actually need to be
realized anymore, for example because the unit went down. To handle
that, check the unit state again, if realization makes sense.

Redundant realization is usually not a problem, except when the unit is
not actually running, hence check exactly for that.
2017-11-25 17:08:21 +01:00
Lennart Poettering 0f2d84d2cc cgroup: drop unused parameter from function 2017-11-25 17:08:21 +01:00
Evgeny Vereshchagin 0fb8449930 cgroup: downgrade the log level of "invocation id" messages to debug (#7422)
Now that d3070fbdf6 has been merged, these errors are not as
critical as they used to be.
2017-11-23 11:07:20 +01:00
Lennart Poettering 64e844e5ca cgroup: fix delegation on the unified hierarchy
Make sure to add the delegation mask to the mask of controllers we have
to enable on our own unit. Do not claim it was a members mask, as such
a logic would mean we'd collide with cgroupv2's "no processes on inner
nodes policy".

This change does the right thing: it means any controller enabled
through Controllers= will be made available to subcrgoups of our unit,
but the unit itself has to still enable it through
cgroup.subtree_control (which it can since that file is delegated too)
to be inherited further down.

Or to say this differently: we only should manipulate
cgroup.subtree_control ourselves for inner nodes (i.e. slices), and
for leaves we need to provide a way to enable controllers in the slices
above, but stay away from the cgroup's own cgroup.subtree_control —
which is what this patch ensures.

Fixes: #7355
2017-11-21 11:54:08 +01:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Zbigniew Jędrzejewski-Szmek d4c819ed23 core: fix message about detected memory hierarchy
Just the error check and message were wrong, otherwise the logic was OK.
2017-11-15 22:58:24 +01:00
Zbigniew Jędrzejewski-Szmek 47a78d4102 Use plural DelegateControllers= consistently 2017-11-15 22:58:24 +01:00
Lennart Poettering 0263828039 core: rework the Delegate= unit file setting to take a list of controller names
Previously it was not possible to select which controllers to enable for
a unit where Delegate=yes was set, as all controllers were enabled. With
this change, this is made configurable, and thus delegation units can
pick specifically what they want to manage themselves, and what they
don't care about.
2017-11-13 10:49:15 +01:00
Lennart Poettering 7546145e26 string-util: add delete_trailing_chars() and skip_leading_chars() helpers
And let's port over a couple of users to the new APIs.
2017-11-13 10:47:15 +01:00
Lennart Poettering 3160497017 cgroup: make use of unit_get_subtree_mask() where appropriate
subtree_mask is own_mask | members_mask, let's make use of that to
shorten a few things
2017-11-13 10:24:03 +01:00
Lennart Poettering eef85c4a3f core: track why unit dependencies came to be
This replaces the dependencies Set* objects by Hashmap* objects, where
the key is the depending Unit, and the value is a bitmask encoding why
the specific dependency was created.

The bitmask contains a number of different, defined bits, that indicate
why dependencies exist, for example whether they are created due to
explicitly configured deps in files, by udev rules or implicitly.

Note that memory usage is not increased by this change, even though we
store more information, as we manage to encode the bit mask inside the
value pointer each Hashmap entry contains.

Why this all? When we know how a dependency came to be, we can update
dependencies correctly when a configuration source changes but others
are left unaltered. Specifically:

1. We can fix UDEV_WANTS dependency generation: so far we kept adding
   dependencies configured that way, but if a device lost such a
   dependency we couldn't them again as there was no scheme for removing
   of dependencies in place.

2. We can implement "pin-pointed" reload of unit files. If we know what
   dependencies were created as result of configuration in a unit file,
   then we know what to flush out when we want to reload it.

3. It's useful for debugging: "systemd-analyze dump" now shows
   this information, helping substantially with understanding how
   systemd's dependency tree came to be the way it came to be.
2017-11-10 19:45:29 +01:00
Yu Watanabe 4c70109600 tree-wide: use IN_SET macro (#6977) 2017-10-04 16:01:32 +02:00
Lennart Poettering 4724964040 cgroup: IN_SET() FTW! 2017-09-27 18:26:18 +02:00
Lennart Poettering 09e2465407 cgroup: after determining that a cgroup is empty, asynchronously dispatch this
This makes sure that if we learn via inotify or another event source
that a cgroup is empty, and we checked that this is indeed the case (as
we might get spurious notifications through inotify, as the inotify
logic through the "cgroups.event" is pretty unspecific and might be
trigger for a variety of reasons), then we'll enqueue a defer event for
it, at a priority lower than SIGCHLD handling, so that we know for sure
that if there's waitid() data for a process we used it before
considering the cgroup empty notification.

Fixes: #6608
2017-09-27 18:26:18 +02:00
Lennart Poettering 91a6073ef7 core: rename cgroup_queue → cgroup_realize_queue
We are about to add second cgroup-related queue, called
"cgroup_empty_queue", hence let's rename "cgroup_queue" to
"cgroup_realize_queue" (as that is its purpose) to minimize confusion
about the two queues.

Just a rename, no functional changes.
2017-09-27 17:59:25 +02:00
Zbigniew Jędrzejewski-Szmek 2e4025c0f9 core/cgroup: add a helper macro for a common pattern (#6926) 2017-09-27 17:54:06 +02:00
Lennart Poettering cf3b4be101 cgroup: refuse to return accounting data if accounting isn't turned on
We used to be a bit sloppy on this, and handed out accounting data even
for units where accounting wasn't explicitly enabled. Let's be stricter
here, so that we know the accounting data is actually fully valid. This
is necessary, as the accounting data is no longer stored exclusively in
cgroupfs, but is partly maintained external of that, and flushed during
unit starts. We should hence only expose accounting data we really know
is fully current.
2017-09-22 15:24:55 +02:00
Lennart Poettering 58d83430e1 core: when coming back from reload/reexec, reapply all cgroup properties
With this change we'll invalidate all cgroup settings after coming back
from a daemon reload/reexec, so that the new settings are instantly
applied.

This is useful for the BPF case, because we don't serialize/deserialize
the BPF program fd, and hence have to install a new, updated BPF program
when coming back from the reload/reexec. However, this is also useful
for the rest of the cgroup settings, as it ensures that user
configuration really takes effect wherever we can.
2017-09-22 15:24:55 +02:00
Lennart Poettering 6b659ed87e core: serialize/deserialize IP accounting across daemon reload/reexec
Make sure the current IP accounting counters aren't lost during
reload/reexec.

Note that we destroy all BPF file objects during a reload: the BPF
programs, the access and the accounting maps. The former two need to be
regenerated anyway with the newly loaded configuration data, but the
latter one needs to survive reloads/reexec. In this implementation I
opted to only save/restore the accounting map content instead of the map
itself. While this opens a (theoretic) window where IP traffic is still
accounted to the old map after we read it out, and we thus miss a few
bytes this has the benefit that we can alter the map layout between
versions should the need arise.
2017-09-22 15:24:55 +02:00
Lennart Poettering c21c99060b cgroup: dump the newly added IP settings in the cgroup context 2017-09-22 15:24:55 +02:00
Daniel Mack 906c06f64a cgroup, unit, fragment parser: make use of new firewall functions 2017-09-22 15:24:55 +02:00
Daniel Mack 6a48d82f02 cgroup: add fields to accommodate eBPF related details
Add pointers for compiled eBPF programs as well as list heads for allowed
and denied hosts for both directions.
2017-09-22 15:24:54 +02:00
Lennart Poettering 10bd3e2e4c manager: watching the cgroup2 inotify fd is safe in test runs too
Less deviation between test runs and normal runs is always a good idea,
hence enable more stuff that is safe in test runs
2017-09-22 15:24:54 +02:00
Lennart Poettering 7cce4fb7f7 cgroup: always invalidate "cpu" and "cpuacct" together
This doesn't really matter, as we never invalidate cpuacct explicitly,
and there's no real reason to care for it explicitly, however it's
prettier if we always treat cpu and cpuacct as belonging together, the
same way we conisder "io" and "blkio" to belong together.
2017-09-22 15:24:54 +02:00
Zbigniew Jędrzejewski-Szmek e0a3da1fd2 Make test_run into a flags field and disable generators again
Now generators are only run in systemd --test mode, where this makes
most sense (how are you going to test what would happen otherwise?).

Fixes #6842.

v2:
- rename test_run to test_run_flags
2017-09-19 20:14:05 +02:00
Lennart Poettering 27458ed629 tree-wide: use path_startswith() rather than startswith() where ever that's appropriate
When checking path prefixes we really should use the right APIs, just in
case people add multiple slashes to their paths...
2017-08-09 19:03:39 +02:00
Zbigniew Jędrzejewski-Szmek a132bef023 Drop kdbus bits
Some kdbus_flag and memfd related parts are left behind, because they
are entangled with the "legacy" dbus support.

test-bus-benchmark is switched to "manual". It was already broken before
(in the non-kdbus mode) but apparently nobody noticed. Hopefully it can
be fixed later.
2017-07-23 12:01:54 -04:00
Lennart Poettering df0ff12775 tree-wide: make use of getpid_cached() wherever we can
This moves pretty much all uses of getpid() over to getpid_raw(). I
didn't specifically check whether the optimization is worth it for each
replacement, but in order to keep things simple and systematic I
switched over everything at once.
2017-07-20 20:27:24 +02:00
Zbigniew Jędrzejewski-Szmek 25f027c5ef tree-wide: when %m is used in log_*, always specify errno explicitly
All those uses were correct, but I think it's better to be explicit.
Using implicit errno is too error prone, and with this change we can require
(in the sense of a style guideline) that the code is always specified.

Helpful query: git grep -n -P 'log_[^s][a-z]+\(.*%m'
2017-05-19 14:24:03 -04:00