A service might be able to detect errors by itself that may require the
system to take the same action as if the service locked up. Add a
WATCHDOG=trigger state change notification to sd_notify() to let the
service manager know about the self-detected misery and instantly
trigger the configured watchdog behaviour.
When shooting down a service with SIGABRT the user might want to have a
much longer stop timeout than on regular stops/shutdowns. Especially in
the face of short stop timeouts the time might not be sufficient to
write huge core dumps before the service is killed.
This commit adds a dedicated (Default)TimeoutAbortSec= timer that is
used when stopping a service via SIGABRT. In all other cases the
existing TimeoutStopSec= is used. The timer value is unset by default
to skip the special handling and use TimeoutStopSec= for state
'stop-watchdog' to keep the old behaviour.
If the service is in state 'stop-watchdog' and the service should be
stopped explicitly we still go to 'stop-sigterm' and re-apply the usual
TimeoutStopSec= timeout.
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).
On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".
All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
This makes two changes:
1. Instead of resetting the configured service TTY each time after a
process exited, let's do so only when the service goes back to "dead"
state. This should be preferable in case the started processes leave
background child processes around that still reference the TTY.
2. chmod() and chown() the TTY at the same time. This should make it
safe to run "systemd-run -p DynamicUser=1 -p StandardInput=tty -p
TTYPath=/dev/tty8 /bin/bash" without leaving a TTY owned by a dynamic
user around.
If we know that main pid is our child then it's unnecessary to watch all
other processes of a unit since in this case we will get SIGCHLD when the main
process will exit and will act upon accordingly.
So let's watch all processes only if the main process is not our child since in
this case we need to detect when the cgroup will become empty in order to
figure out when the service becomes dead. This is only needed by cgroupv1.
Some PIDs can remain in the watched list even though their processes have
exited since a long time. It can easily happen if the main process of a forking
service manages to spawn a child before the control process exits for example.
However when a pid is about to be mapped to a unit by calling unit_watch_pid(),
the caller usually knows if the pid should belong to this unit exclusively: if
we just forked() off a child, then we can be sure that its PID is otherwise
unused. In this case we take this opportunity to remove any stalled PIDs from
the watched process list.
If we learnt about a PID in any other form (for example via PID file, via
searching, MAINPID= and so on), then we can't assume anything.
Just some renaming, no change in behaviour.
Background: I'd like to add more functions unit_test_xyz() that test
various things, hence let's streamline the naming a bit.
When there are multiple ExecStop= statements, the next command would continue
to run even after TimeoutStopSec= is up and sends SIGTERM. This is because,
unless Type= is oneshot, the exit code/status would evaluate to SERVICE_SUCCESS
in service_sigchld_event()'s call to is_clean_exit(). This success indicates
following commands would continue running until the end of the list
is reached, or another timeout is hit and SIGKILL is sent.
Since long running processes should not be invoked in non-SERVICE_EXEC_START
commands, consider them for EXIT_CLEAN_COMMAND instead of EXIT_CLEAN_DAEMON.
Passing EXIT_CLEAN_COMMAND to is_clean_exit() evaluates the SIGTERM exit
code/status to failure and will stop execution after the first timeout is hit.
Fixes#11431
KillMode=mixed and control group are used to indicate that all
process should be killed off. SendSIGKILL is used for services
that require a clean shutdown. These are typically database
service where a SigKilled process would result in a lengthy
recovery and who's shutdown or startup time is quite variable
(so Timeout settings aren't of use).
Here we take these two factors and refuse to start a service if
there are existing processes within a control group. Databases,
while generally having some protection against multiple instances
running, lets not stress the rigor of these. Also ExecStartPre
parts of the service aren't as rigoriously written to protect
against against multiple use.
closes#8630
Fixes: #11305Fixes: #3260
Related: #11456
So, here's what happens in the described scenario in #11305. A unit goes
down, and that triggeres stop jobs for the other two units as they were
bound to it. Now, the timer for manager triggered restarts kicks in and
schedules a restart job with the JOB_FAIL job mode. This means there is
a stop job installed on those units, and now due to them being bound to
us they also get a restart job enqueued. This however is a conflicts, as
neither stop can merge into restart, nor restart into stop. However,
restart should be able to replace stop in any case. If the stop
procedure is ongoing, it can cancel the stop job, install itself, and
then after reaching dead finish and convert itself to a start job.
However, if we increase the timer, then it can always take those units
from inactive -> auto-restart.
We change the job mode to JOB_REPLACE so the restart job cancels the
stop job and installs itself.
Also, the original bug could be worked around by bumping RestartSec= to
avoid the conflicting.
This doesn't seem to be something that is going to break uses. That is
because for those who already had it working, there must have never been
conflicting jobs, as that would result in a desctructive transaction by
virtue of the job mode used.
After this change, the test case is able to work nicely without issues.
KeyringMode option is useful for user services. Also, documentation for the
option suggests that the option applies to user services. However, setting the
option to any of its allowed values has no effect.
This commit fixes that and removes EXEC_NEW_KEYRING flag. The flag is no longer
necessary: instead of checking if the flag is set we can check if keyring_mode
is not equal to EXEC_KEYRING_INHERIT.
We previously returned -EPERM but it can be returned for various other reasons
too.
Let's use -ENOLINK instead as this value shouldn't be used currently. This
allows users of CHASE_SAFE to detect without any ambiguities when unsafe
transitions are encountered by chase_symlinks().
All current users of CHASE_SAFE that explicitly reacted on -EPERM have been
converted to react on -ENOLINK.
This allows clients to follow our internal state changes safely.
Previously, quick state changes (for example, when we restart a unit due
to Restart= after it quickly transitioned through DEAD/FAILED states)
would be coalesced into one bus signal event, with this change there's
the guarantee that all state changes after the unit was announced ones
are reflected on th bus.
Note we only do this kind of guaranteed flushing only for unit state
changes, not for other unit property changes, where clients still have
to expect coalescing. This is because the unit state is a very
important, high-level concept.
Fixes: #10185
add new "systemd-run-generator" for running arbitrary commands from the kernel command line as system services using the "systemd.run=" kernel command line switch
This adds SuccessActionExitStatus= and FailureActionExitStatus= that may
be used to configure the exit status to propagate in when
SuccessAction=exit or FailureAction=exit is used.
When not specified let's also propagate the exit status of the main
process we fork off for the unit.
Otherwise we might conflict with the "no-processes-in-inner-cgroup" rule
of cgroupsv2. Consider nspawn starting up and initializing its cgroup
hierarchy with "supervisor/" and "payload/" as subcgroup, with itself
moved into the former and the payload into the latter. Now, if an
ExecStartPre= is run right after it cannot be placed in the main cgroup,
because that is now in inner cgroup with populated children.
Hence, let's run these helpers in another sub-cgroup .control/ below it.
This is somewhat ugly since it weakens the clear separation of
ownership, but given that this is an explicit contract, and double opt-in should be acceptable.
Fixes: #10482
This splits the "environment" field of Manager into two:
transient_environment and client_environment. The former is generated
from configuration file, kernel cmdline, environment generators. The
latter is the one the user can control with "systemctl set-environment"
and similar.
Both sets are merged transparently whenever needed. Separating the two
sets has the benefit that we can safely flush out the former while
keeping the latter during daemon reload cycles, so that env var settings
from env generators or configuration files do not accumulate, but
dynamic API changes are kept around.
Note that this change is not entirely transparent to users: if the user
first uses "set-environment" to override a transient variable, and then
uses "unset-environment" to unset it again things will revert to the
original transient variable now, while previously the variable was fully
removed. This change in behaviour should not matter too much though I
figure.
Fixes: #9972
When we start a service process we pass the selected watchdog timeout to
it with the $WATCHDOG_USEC environment variable. If the unit file is
reconfigured later, we need to make sure to continue to honour the
original timeout, i.e. watch $WATCHDOG_USEC was set to, otherwise we'll
expect the ping at a different time as the service process is sending it
to us.
Hence, whenever we start a unit, save the watchdog timeout, and stick to
that for everything we do.
Fixes: #9467
Let's make sure we always use the right watchdog timeout: when a service
has overwritten it, then stick to it, also for follow-up processes of
the same service.
Let's be more careful with what we serialize: let's ensure we never
serialize strings that are longer than LONG_LINE_MAX, so that we know we
can read them back with read_line(…, LONG_LINE_MAX, …) safely.
In order to implement this all serialization functions are move to
serialize.[ch], and internally will do line size checks. We'd rather
skip a serialization line (with a loud warning) than write an overly
long line out. Of course, this is just a second level protection, after
all the data we serialize shouldn't be this long in the first place.
While we are at it also clean up logging: while serializing make sure to
always log about errors immediately. Also, (void)ify all calls we don't
expect errors in (or catch errors as part of the general
fflush_and_check() at the end.
The service would always be in state == SERVICE_INACTIVE, but it needs to go
through state == SERVICE_START so that SuccessAction/FailureAction are executed.
Allows configuring the watchdog signal (with a default of SIGABRT).
This allows an alternative to SIGABRT when coredumps are not desirable.
Appropriate references to SIGABRT or aborting were renamed to reflect
more liberal watchdog signals.
Closes#8658