This adds a new bus call to service and scope units called
AttachProcesses() that moves arbitrary processes into the cgroup of the
unit. The primary user for this new API is systemd itself: the systemd
--user instance uses this call of the systemd --system instance to
migrate processes if itself gets the request to migrate processes and
the kernel refuses this due to access restrictions.
The primary use-case of this is to make "systemd-run --scope --user …"
invoked from user session scopes work correctly on pure cgroupsv2
environments. There, the kernel refuses to migrate processes between two
unprivileged-owned cgroups unless the requestor as well as the ownership
of the closest parent cgroup all match. This however is not the case
between the session-XYZ.scope unit of a login session and the
user@ABC.service of the systemd --user instance.
The new logic always tries to move the processes on its own, but if
that doesn't work when being the user manager, then the system manager
is asked to do it instead.
The new operation is relatively restrictive: it will only allow to move
the processes like this if the caller is root, or the UID of the target
unit, caller and process all match. Note that this means that
unprivileged users cannot attach processes to scope units, as those do
not have "owning" users (i.e. they have now User= field).
Fixes: #3388
This corrects the control flow: when we reuse the API bus as system bus,
let's definitely invoke bus_init_system() too, so that it is called
regardless how we acquired the bus object.
(Note that this doesn't actually change anything, as we only inherit the
bus like this in system mode, and bus_init_system() doesn't do anything
in system bus, besides writing a log message)
Let's better be safe than sorry, and validate that api_bus is not NULL
before we send messages to it. Of course, strictly speaking this
shouldn't actually be necessary, as the tracker object should not exist
without the bus, but let's be extra sure.
Previously, we'd synchronize bus names immediately when we succeeded
connecting to the bus, potentially even before coldplugging the units.
This was problematic, as synchronizing bus names meant invoking the
per-unit name change handler function which might change the unit's
state — which will result in consistency when done before we coldplug
things.
With this change we instead enqueue a job for the event loop to resync
the names in a later loop iteration, i.e. at a point where we know
coldplugging has finished.
This splits out the code that translates a unit name into a Unit* object
from method_get_unit(), and reuses it all other functions that operate
similar to it. This effectively means all those calls now optionally
take an empty unit string which now means the same as the client's unit.
This useful behaviour of the GetUnit() bus call is thus extended to all
other matching bus calls.
Similar, the same logic from method_load_unit() is also generalized and
reused wherever appropriate.
This patch does four things:
1. Adds more comments that clarify the order in which things appear in
the file
2. All entries are placed in the order in which their SD_BUS_METHOD()
macros appear in the C vtables.
3. A couple of missing entries are added that should be open to all or
do polkit
4. Corrects the interface name for the GetProcesses() calls. They belong
to the per-unit interface, not to Unit
Let's also use the journal if it is currently reloading. In that state
it should also be able to process our requests. Moreover, we might
otherwise end up disconnecting/reconnecting from the journal without
really any need to hence, relax the check accordingly.
This removes the current bus_init() call, as it had multiple problems:
it munged handling of the three bus connections we care about (private,
"api" and system) into one, even though the conditions when which was
ready are very different. It also added redundant logging, as the
individual calls it called all logged on their own anyway.
The three calls bus_init_api(), bus_init_private() and bus_init_system()
are now made public. A new call manager_dbus_is_running() is added that
works much like manager_journal_is_running() and is a lot more careful
when checking whether dbus is around. Optionally it checks the unit's
deserialized_state rather than state, in order to accomodate for cases
where we cant to connect to the bus before deserializing the
"subscribed" list, before coldplugging the units.
manager_recheck_dbus() is added, that works a lot like
manager_recheck_journal() and is invoked in unit_notify(), i.e. when
units change state.
All in all this should make handling a bit more alike to journal
handling, and it also fixes one major bug: when running in user mode
we'll now connect to the system bus early on, without conditionalizing
this in anyway.
I figure saying "systemd" here was a typo, and it should have been
"system". (Yes, it becomes very hard after a while typing "system"
correctly if you type "systemd" so often.) That said, "systemd" in some
ways is actually more correct, since BPF might be available for the
system instance but not in the user instance.
Either way, talking of "this systemd" is weird, let's reword this to be
"this manager", to emphasize that it's the local instance of systemd
where BPF is not available, but that it might be available otherwise.
This reworks is_kernel_thread() a bit. Instead of checking whether
/proc/$pid/cmdline is entirely empty we now parse the 'flags' field from
/proc/$pid/stat and check the PF_KTHREAD flag, which directly encodes
whether something is a kernel thread.
Why all this? With current kernels userspace processes can set their
command line to empty too (through PR_SET_MM_ARG_START and friends), and
could potentially confuse us. Hence, let's use a more reliable way to
detect kernels like this.
Let's simplify things a bit: we so far called both functions every
single time, let's just merge one into the other, so that we have fewer
functions to call.
Currently we allowed delegation for alluntis with cgroup backing
except for slices. Let's make this a bit more strict for now, and only
allow this in service and scope units.
Let's also add a generic accessor unit_cgroup_delegate() for checking
whether a unit has delegation turned on that checks the new bool first.
Also, when doing transient units, let's explcitly refuse turning on
delegation for unit types that don#t support it. This is mostly
cosmetical as we wouldn't act on the delegation request anyway, but
certainly helpful for debugging.
This adds some paranoia code that moves some of the fds we allocate for
longer periods of times to fds > 2 if they are allocated below this
boundary. This is a paranoid safety thing, in order to avoid that
external code might end up erroneously use our fds under the assumption
they were valid stdin/stdout/stderr. Think: some app closes
stdin/stdout/stderr and then invokes 'fprintf(stderr, …' which causes
writes on our fds.
This both adds the helper to do the moving as well as ports over a
number of users to this new logic. Since we don't want to litter all our
code with invocations of this I tried to strictly focus on fds we keep
open for long periods of times only and only in code that is frequently
loaded into foreign programs (under the assumptions that in our own
codebase we are smart enough to always keep stdin/stdout/stderr
allocated to avoid this pitfall). Specifically this means all code used
by NSS and our sd-xyz API:
1. our logging APIs
2. sd-event
3. sd-bus
4. sd-resolve
5. sd-netlink
This changed was inspired by this:
https://github.com/systemd/systemd/issues/8075#issuecomment-363689755
This shows that apparently IRL there are programs that do close
stdin/stdout/stderr, and we should accomodate for that.
Note that this won't fix any bugs, this just makes sure that buggy
programs are less likely to interfere with out own code.
This change adds support for controlling the suspend-on-lid-close
behaviour based on the power status as well as whether the machine is
docked or has an external monitor. For backwards compatibility the new
configuration file variable is ignored completely by default, and must
be set explicitly before being considered in any decisions.
Let's read the PID file after all if there's a potentially unsafe
symlink chain in place. But if we do, then refuse taking the PID if its
outside of the cgroup.
Fixes: #8085
So far we didn't document control, transient, dbus config, or generator paths.
But those paths are visible to users, and they need to understand why systemd
loads units from those paths, and how the precedence hierarchy looks.
The whole thing is a bit messy, since the list of paths is quite long.
I made the tables a bit shorter by combining rows for the alternatives
where $XDG_* is set and the fallback.
In various places, tags are split like <element
param="blah">
this. This is necessary to keep everyting in one logical XML line so that
docbook renders the table properly.
Replaces #8050.
$ diff -u <(old/systemd-analyze --user unit-paths) <(new/systemd-analyze --user unit-paths)|colordiff
--- /proc/self/fd/14 2018-02-08 14:36:34.190046129 +0100
+++ /proc/self/fd/15 2018-02-08 14:36:34.190046129 +0100
@@ -1,5 +1,5 @@
-/home/zbyszek/.config/systemd/system.control
-/run/user/1000/systemd/system.control
+/home/zbyszek/.config/systemd/user.control
+/run/user/1000/systemd/user.control
/run/user/1000/systemd/transient
...
Strictly speaking, online upgrades of user instances through daemon-reexec will
be broken. We can get away with this since
a) reexecs of the user instance are not commonly done, at least package upgrade
scripts don't do this afawk.
b) cgroups aren't delegateable on cgroupsv1 there's little reason to use "systemctl
set-property" for --user mode
It's not good if the paths are in different order. With --user, we expect
more paths, but it must be a strict superset, and the order for the ones
that appear in both sets must be the same.
$ diff -u <(build/systemd-analyze --global unit-paths) <(build/systemd-analyze --user unit-paths)|colordiff
--- /proc/self/fd/14 2018-02-08 14:11:45.425353107 +0100
+++ /proc/self/fd/15 2018-02-08 14:11:45.426353116 +0100
@@ -1,6 +1,17 @@
+/home/zbyszek/.config/systemd/system.control
+/run/user/1000/systemd/system.control
+/run/user/1000/systemd/transient
+/run/user/1000/systemd/generator.early
+/home/zbyszek/.config/systemd/user
/etc/systemd/user
+/run/user/1000/systemd/user
/run/systemd/user
+/run/user/1000/systemd/generator
+/home/zbyszek/.local/share/systemd/user
+/home/zbyszek/.local/share/flatpak/exports/share/systemd/user
+/var/lib/flatpak/exports/share/systemd/user
/usr/local/share/systemd/user
/usr/share/systemd/user
/usr/local/lib/systemd/user
/usr/lib/systemd/user
+/run/user/1000/systemd/generator.late
A test is added so that we don't regress on this.
This doesn't matter that much, because set-property --global does not work,
so at least those paths wouldn't be used automatically. It is still possible
to create such snippets manually, so we better fix this.
There are cases that we want to trigger and settle only specific
commands. For example, let's say at boot time we want to make sure all
the graphics devices are working correctly because it's critical for
booting, but not the USB subsystem (we'll trigger USB events later). So
we do:
udevadm trigger --action="add" --subsystem-match="graphics"
udevadm settle
However, we cannot block the kernel from emitting kernel events from
discovering USB devices. So if any of the USB kernel event was emitted
before the settle command, the settle command would still wait for the
entire queue to complete. And if the USB event takes a long time to be
processed, the system slows down.
The new `settle` option allows the `trigger` command to wait for only
the triggered events, and effectively solves this problem.