When shooting down a service with SIGABRT the user might want to have a
much longer stop timeout than on regular stops/shutdowns. Especially in
the face of short stop timeouts the time might not be sufficient to
write huge core dumps before the service is killed.
This commit adds a dedicated (Default)TimeoutAbortSec= timer that is
used when stopping a service via SIGABRT. In all other cases the
existing TimeoutStopSec= is used. The timer value is unset by default
to skip the special handling and use TimeoutStopSec= for state
'stop-watchdog' to keep the old behaviour.
If the service is in state 'stop-watchdog' and the service should be
stopped explicitly we still go to 'stop-sigterm' and re-apply the usual
TimeoutStopSec= timeout.
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).
On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".
All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
When we start a service process we pass the selected watchdog timeout to
it with the $WATCHDOG_USEC environment variable. If the unit file is
reconfigured later, we need to make sure to continue to honour the
original timeout, i.e. watch $WATCHDOG_USEC was set to, otherwise we'll
expect the ping at a different time as the service process is sending it
to us.
Hence, whenever we start a unit, save the watchdog timeout, and stick to
that for everything we do.
Fixes: #9467
Users are often surprised that "systemd-run" command lines like
"systemd-run -p User=idontexist /bin/true" will return successfully,
even though the logs show that the process couldn't be invoked, as the
user "idontexist" doesn't exist. This is because Type=simple will only
wait until fork() succeeded before returning start-up success.
This patch adds a new service type Type=exec, which is very similar to
Type=simple, but waits until the child process completed the execve()
before returning success. It uses a pipe that has O_CLOEXEC set for this
logic, so that the kernel automatically sends POLLHUP on it when the
execve() succeeded but leaves the pipe open if not. This means PID 1
waits exactly until the execve() succeeded in the child, and not longer
and not shorter, which is the desired functionality.
Making use of this new functionality, the command line
"systemd-run -p User=idontexist -p Type=exec /bin/true" will now fail,
as expected.
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
Previously we were a bit sloppy with the index and size types of arrays,
we'd regularly use unsigned. While I don't think this ever resulted in
real issues I think we should be more careful there and follow a
stricter regime: unless there's a strong reason not to use size_t for
array sizes and indexes, size_t it should be. Any allocations we do
ultimately will use size_t anyway, and converting forth and back between
unsigned and size_t will always be a source of problems.
Note that on 32bit machines "unsigned" and "size_t" are equivalent, and
on 64bit machines our arrays shouldn't grow that large anyway, and if
they do we have a problem, however that kind of overly large allocation
we have protections for usually, but for overflows we do not have that
so much, hence let's add it.
So yeah, it's a story of the current code being already "good enough",
but I think some extra type hygiene is better.
This patch tries to be comprehensive, but it probably isn't and I missed
a few cases. But I guess we can cover that later as we notice it. Among
smaller fixes, this changes:
1. strv_length()' return type becomes size_t
2. the unit file changes array size becomes size_t
3. DNS answer and query array sizes become size_t
Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=76745
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
An auto-restarted unit B may depend on unit A with StopWhenUnneeded=yes.
If A stops before B's restart timeout expires, it'll be started again as part
of B's dependent jobs. However, if stopping takes longer than the timeout, B's
running stop job collides start job which also cancels B's start job. Result is
that neither A or B are active.
Currently, when a service with automatic restarting fails, it transitions
through following states:
1) SERVICE_FAILED or SERVICE_DEAD to indicate the failure,
2) SERVICE_AUTO_RESTART while restart timer is running.
The StopWhenUnneeded= check takes place in service_enter_dead between the two
state mentioned above. We temporarily store the auto restart flag to query it
during the check. Because we don't return control to the main event loop, this
new service unit flag needn't be serialized.
This patch prevents the pathologic situation when the service with Restart=
won't restart automatically. As a side effect it also avoid restarting the
dependency unit with StopWhenUnneeded=yes.
Fixes: #7377
This introduces a new function unit_prepare_exec() that encapsulates a
number of calls we do in preparation for spawning off some processes in
all our unit types that do so.
This allows us to neatly unify a bit of code between unit types and
shorten our code.
When preparing for a restart we quickly go through the DEAD/INACTIVE
service state before entering AUTO_RESTART. When doing this, we need to
make sure we don't destroy the FD store. Previously this was done by
checking the failure state of the unit, and keeping the FD store around
when the unit failed, under the assumption that the restart logic will
then get into action.
This is not entirely correct howver, as there might be failure states
that will no result in restarts.
With this commit we slightly alter the logic: a ref counter for the fd
store is added, that is increased right before we handle the restart
logic, and decreased again right-after.
This should ensure that the fdstore lives exactly as long as it needs.
Follow-up for f0bfbfac43.
This adds a per-service restart counter. Each time an automatic
restart is scheduled (due to Restart=) it is increased by one. Its
current value is exposed over the bus as NRestarts=. It is also logged
(in a structured, recognizable way) on each restart.
Note that this really only counts automatic starts triggered by Restart=
(which it nicely complements). Manual restarts will reset the counter,
as will explicit calls to "systemctl reset-failed". It's supposed to be
a tool for measure the automatic restart feature, and nothing else.
Fixes: #4126
Introduce a SERVICE_FAILURE_PROTOCOL error type for when a service does
not follow the protocol.
This error type is used when a pid file is expected, but not delivered.
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).
For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.
A simple way to test this is:
systemd-run -p DynamicUser=1 /bin/sleep 99999
Add sd_notify() parameter to change watchdog_usec during runtime.
Application can change watchdog_usec value by
sd_notify like this. Example. sd_notify(0, "WATCHDOG_USEC=20000000").
To reset watchdog_usec as configured value in service file,
restart service.
Notice.
sd_event is not currently supported. If application uses
sd_event_set_watchdog, or sd_watchdog_enabled, do not use
"WATCHDOG_USEC" option through sd_notify.
Let's move the enforcement of the per-unit start limit from unit.c into the
type-specific files again. For unit types that know a concept of "result" codes
this allows us to hook up the start limit condition to it with an explicit
result code. Also, this makes sure that the state checks in clal like
service_start() may be done before the start limit is checked, as the start
limit really should be checked last, right before everything has been verified
to be in order.
The generic start limit logic is left in unit.c, but the invocation of it is
moved into the per-type files, in the various xyz_start() functions, so that
they may place the check at the right location.
Note that this change drops the enforcement entirely from device, slice, target
and scope units, since these unit types generally may not fail activation, or
may only be activated a single time. This is also documented now.
Note that restores the "start-limit-hit" result code that existed before
6bf0f408e4 already in the service code. However,
it's not introduced for all units that have a result code concept.
Fixes#3166.
The "resources" error is really just the generic error we return when
we hit some kind of error and we have no more appropriate error for the case to
return, for example because of some OS error.
Hence, reword the explanation and don't claim any relation to resource limits.
Admittedly, the "resources" service error is a bit of a misnomer, but I figure
it's kind of API now.
Fixes: #2716
This moves the StartLimitBurst=, StartLimitInterval=, StartLimitAction=, RebootArgument= from the [Service] section
into the [Unit] section of unit files, and thus support it in all unit types, not just in services.
This way we can enforce the start limit much earlier, in particular before testing the unit conditions, so that
repeated start-up failure due to failed conditions is also considered for the start limit logic.
For compatibility the four options may also be configured in the [Service] section still, but we only document them in
their new section [Unit].
This also renamed the socket unit failure code "service-failed-permanent" into "service-start-limit-hit" to express
more clearly what it is about, after all it's only triggered through the start limit being hit.
Finally, the code in busname_trigger_notify() and socket_trigger_notify() is altered to become more alike.
Fixes: #2467
This clean-ups timeout handling in PID 1. Specifically, instead of storing 0 in internal timeout variables as
indication for a disabled timeout, use USEC_INFINITY which is in-line with how we do this in the rest of our code
(following the logic that 0 means "no", and USEC_INFINITY means "never").
This also replace all usec_t additions with invocations to usec_add(), so that USEC_INFINITY is properly propagated,
and sd-event considers it has indication for turning off the event source.
This also alters the deserialization of the units to restart timeouts from the time they were originally started from.
Before this patch timeouts would be restarted beginning with the time of the deserialization, which could lead to
artificially prolonged timeouts if a daemon reload took place.
Finally, a new RuntimeMaxSec= setting is introduced for service units, that specifies a maximum runtime after which a
specific service is forcibly terminated. This is useful to put time limits on time-intensive processing jobs.
This also simplifies the various xyz_spawn() calls of the various types in that explicit distruction of the timers is
removed, as that is done anyway by the state change handlers, and a state change is always done when the xyz_spawn()
calls fail.
Fixes: #2249
During daemon-reload, PID1 temporarly loses its DBus connection, so there's
a small window in which all signals sent by dbus-daemon are lost.
This is a problem, since we rely on the NameOwnerChanged signals in order to
consider a service with Type=dbus fully started or terminated, respectively.
In order to fix this, a rewrite of bus_list_names() is necessary. We used
to walk the current list of names on the bus, and blindly triggered the
bus_name_owner_change() callback on each service, providing the actual name
as current owner. This implementation has a number of problems:
* We cannot detect if the the name was moved from one owner to the other
while we were reloading
* We don't notify services which missed the name loss signal
* Providing the actual name as current owner is a hack, as the comment also
admits.
To fix this, this patch carries the following changes:
* Track the name of the current bus name owner, and (de-)serialize it
during reload. This way, we can detect changes.
* In bus_list_names(), walk the list of bus names we're interested in
first, and then see if the name is active on the bus. If it is,
check it it's still the same as it used to be, and synthesize
NameOwnerChanged signals for the name add and/or loss.
This should fully synchronize the current name list with the internal
state of all services.
When starting a transient service, allow setting stdin/stdout/stderr fds
for it, by passing them in via the bus.
This also simplifies some of the serialization code for units.
This adds support for naming file descriptors passed using socket
activation. The names are passed in a new $LISTEN_FDNAMES= environment
variable, that matches the existign $LISTEN_FDS= one and contains a
colon-separated list of names.
This also adds support for naming fds submitted to the per-service fd
store using FDNAME= in the sd_notify() message.
This also adds a new FileDescriptorName= setting for socket unit files
to set the name for fds created by socket units.
This also adds a new call sd_listen_fds_with_names(), that is similar to
sd_listen_fds(), but also returns the names of the fds.
systemd-activate gained the new --fdname= switch to specify a name for
testing socket activation.
This is based on #1247 by Maciej Wereski.
Fixes#1247.
This patch removes includes that are not used. The removals were found with
include-what-you-use which checks if any of the symbols from a header is
in use.
With this change it is possible to send file descriptors to PID 1, via
sd_pid_notify_with_fds() which PID 1 will store individually for each
service, and pass via the usual fd passing logic on next invocation.
This is useful for enable daemon reload schemes where daemons serialize
their state to /run, push their fds into PID 1 and terminate, restoring
their state on next start from the data in /run and passed in from PID
1.
The fds are kept by PID 1 as long as no POLLHUP or POLLERR is seen on
them, and the service they belong to are either not dead or failed, or
have a job queued.
This makes possible to spawn service instances triggered by socket with
MLS/MCS SELinux labels which are created based on information provided by
connected peer.
Implementation of label_get_child_mls_label derived from xinetd.
Reviewed-by: Paul Moore <pmoore@redhat.com>
If BusPolicy= was passed, the parser function will have created
an ExecContext->bus_endpoint object, along with policy information.
In that case, create a kdbus endpoint, and pass its path name to the
namespace logic, to it will be mounted over the actual 'bus' node.
At endpoint creation time, no policy is updloaded. That is done after
fork(), through a separate call. This is necessary because we don't
know the real uid of the process earlier than that.
When this system-wide start-up timeout is hit we execute one of the
failure actions already implemented for services that fail.
This should not only be useful on embedded devices, but also on laptops
which have the power-button reachable when the lid is closed. This
devices, when in a backpack might get powered on by accident due to the
easily reachable power button. We want to make sure that the system
turns itself off if it starts up due this after a while.
When the system manages to fully start-up logind will suspend the
machine by default if the lid is closed. However, in some cases we don't
even get as far as logind, and the boot hangs much earlier, for example
because we ask for a LUKS password that nobody ever enters.
Yeah, this is a real-life problem on my Yoga 13, which has one of those
easily accessible power buttons, even if the device is closed.