With cgroup v2 the cgroup freezer is implemented as a cgroup
attribute called cgroup.freeze. cgroup can be frozen by writing "1"
to the file and kernel will send us a notification through
"cgroup.events" after the operation is finished and processes in the
cgroup entered quiescent state, i.e. they are not scheduled to
run. Writing "0" to the attribute file does the inverse and process
execution is resumed.
This commit exposes above low-level functionality through systemd's DBus
API. Each unit type must provide specialized implementation for these
methods, otherwise, we return an error. So far only service, scope, and
slice unit types provide the support. It is possible to check if a
given unit has the support using CanFreeze() DBus property.
Note that DBus API has a synchronous behavior and we dispatch the reply
to freeze/thaw requests only after the kernel has notified us that
requested operation was completed.
UnitStatusMessageFormats.finished_job, if present,
will be called with the same arguments as
job_get_done_status_message_format() to provide a format string
appropriate for the context
This commit replaces "Started" with "Finished" for started oneshot
units, as mentioned in the referenced issue
Closes#2458.
Similar, refuse triggering deps on units that cannot trigger.
And rework how we ignore After= dependencies on device units, to work
the same way.
See: #14142
Previously, when first connecting to the bus after connecting to it we'd
issue a ListNames() bus call to the driver to figure out which bus names
are currently active. This information was then used to initialize the
initial state for services that use BusName=.
This change removes the whole code for this and replaces it with
something vastly simpler.
First of all, the ListNames() call was issues synchronosuly, which meant
if dbus was for some reason synchronously calling into PID1 for some
reason we'd deadlock. As it turns out there's now a good chance it does:
the nss-systemd userdb hookup means that any user dbus-daemon resolves
might result in a varlink call into PID 1, and dbus resolves quite a lot
of users while parsing its policy. My original goal was to fix this
deadlock.
But as it turns out we don't need the ListNames() call at all anymore,
since #12957 has been merged. That PR was supposed to fix a race where
asynchronous installation of bus matches would cause us missing the
initial owner of a bus name when a service is first started. It fixed it
(correctly) by enquiring with GetOwnerName() who currently owns the
name, right after installing the match. But this means whenever we start watching a bus name we anyway
issue a GetOwnerName() for it, and that means also when first connecting
to the bus we don't need to issue ListNames() anymore since that just
tells us the same info: which names are currently owned.
hence, let's drop ListNames() and instead make better use of the
GetOwnerName() result: if it failed the name is not owned.
Also, while we are at it, let's simplify the unit's owner_name_changed()
callback(): let's drop the "old_owner" argument. We never used that
besides logging, and it's hard to synthesize from just the return of a
GetOwnerName(), hence don't bother.
I see we log this during every boot, even though it is a routine expected event:
Nov 12 14:50:01 krowka systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
(and for other services too). Let's downgrade this to debug level.
https://bugzilla.redhat.com/show_bug.cgi?id=1614871
Discussed in #13743, the -.service semantic conflicts with the
existing root mount and slice names, making this feature not
uniformly extensible to all types. Change the name to be
<type>.d instead.
Updating to this format also extends the top-level dropin to
unit types.
chase_symlinks() would return negative on error, and either a non-negative status
or a non-negative fd when CHASE_OPEN was given. This made the interface quite
complicated, because dependning on the flags used, we would get two different
"types" of return object. Coverity was always confused by this, and flagged
every use of chase_symlinks() without CHASE_OPEN as a resource leak (because it
would this that an fd is returned). This patch uses a saparate output parameter,
so there is no confusion.
(I think it is OK to have functions which return either an error or an fd. It's
only returning *either* an fd or a non-fd that is confusing.)
There is a slight functional change when load_state == UNIT_MERGED. Before,
we would not call unit_load_dropin(), but now we do. I'm not sure if this
causes an actual difference in behaviour, but since all other unit types do
this, I think it's better to do the same thing here too.
This is the most basic consumer of the new systemd-vs-kernel checker,
both acting as a reasonable standalone exerciser of the code, and also
as a way for easy inspection of deviations from systemd internal state.
v2:
- if RestartKillSignal= is not specified, fall back to KillSignal=. This is necessary
to preserve backwards compatibility (and keep KillSignal= generally useful).
The "Ex" variant was originally only added for ExecStartXYZ= but it makes
sense to have feature parity for the rest of the exec command properties
as well (e.g. ExecReload=, ExecStop=, etc).
In high load scenarios it is possible for services to be started
before the NameOwnerChanged signal is properly installed.
Emulate a callback by also queuing a GetNameOwner when the match is
installed.
Fixes: #12956
The implementation is pretty straight-foward: when we get a request to
clean some type of resources we fork off a process doing that, and while
it is running we are in the "cleaning" state.
Fixes#12258.
This is enough to reproduce:
$ systemd-run bash -c 'sleep 10' && systemctl daemon-reload
would result in
Current command vanished from the unit file.
We would serialize as:
ExecStart 0 /usr/bin/bash /usr/bin/bash -c sleep 10000
which of course can't work.
Now we serialize as
ExecStart 0 /usr/bin/bash "/usr/bin/bash" "-c" "sleep 10".
A service might be able to detect errors by itself that may require the
system to take the same action as if the service locked up. Add a
WATCHDOG=trigger state change notification to sd_notify() to let the
service manager know about the self-detected misery and instantly
trigger the configured watchdog behaviour.
When shooting down a service with SIGABRT the user might want to have a
much longer stop timeout than on regular stops/shutdowns. Especially in
the face of short stop timeouts the time might not be sufficient to
write huge core dumps before the service is killed.
This commit adds a dedicated (Default)TimeoutAbortSec= timer that is
used when stopping a service via SIGABRT. In all other cases the
existing TimeoutStopSec= is used. The timer value is unset by default
to skip the special handling and use TimeoutStopSec= for state
'stop-watchdog' to keep the old behaviour.
If the service is in state 'stop-watchdog' and the service should be
stopped explicitly we still go to 'stop-sigterm' and re-apply the usual
TimeoutStopSec= timeout.
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).
On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".
All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).