The "Ex" variant was originally only added for ExecStartXYZ= but it makes
sense to have feature parity for the rest of the exec command properties
as well (e.g. ExecReload=, ExecStop=, etc).
We were passing 1/4th of the size in bytes as argument. So depending
on the size of the array, either we'd only transfer a subset of values,
or we'd get an alignment error.
I opted to embed the Bitmap structure directly in the ExitStatusSet.
This means that memory usage is a bit higher for units which don't define
this setting:
Service changes:
/* size: 2720, cachelines: 43, members: 73 */
/* sum members: 2680, holes: 9, sum holes: 39 */
/* sum bitfield members: 7 bits, bit holes: 1, sum bit holes: 1 bits */
/* last cacheline: 32 bytes */
/* size: 2816, cachelines: 44, members: 73 */
/* sum members: 2776, holes: 9, sum holes: 39 */
/* sum bitfield members: 7 bits, bit holes: 1, sum bit holes: 1 bits */
But this way the code is simpler and we do less pointer chasing.
When shooting down a service with SIGABRT the user might want to have a
much longer stop timeout than on regular stops/shutdowns. Especially in
the face of short stop timeouts the time might not be sufficient to
write huge core dumps before the service is killed.
This commit adds a dedicated (Default)TimeoutAbortSec= timer that is
used when stopping a service via SIGABRT. In all other cases the
existing TimeoutStopSec= is used. The timer value is unset by default
to skip the special handling and use TimeoutStopSec= for state
'stop-watchdog' to keep the old behaviour.
If the service is in state 'stop-watchdog' and the service should be
stopped explicitly we still go to 'stop-sigterm' and re-apply the usual
TimeoutStopSec= timeout.
This is partially a refactoring, but also makes many more places use
unlocked operations implicitly, i.e. all users of fopen_temporary().
AFAICT, the uses are always for short-lived files which are not shared
externally, and are just used within the same context. Locking is not
necessary.
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).
On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".
All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
Previously we tried to be smart: when a new unit appeared and it only
added controllers to the cgroup mask we'd update the cached members mask
in all parents by ORing in the controller flags in their cached values.
Unfortunately this was quite broken, as we missed some conditions when
this cache had to be reset (for example, when a unit got unloaded),
moreover the optimization doesn't work when a controller is removed
anyway (as in that case there's no other way for the parent to iterate
though all children if any other, remaining child unit still needs it).
Hence, let's simplify the logic substantially: instead of updating the
cache on the right events (which we didn't get right), let's simply
invalidate the cache, and generate it lazily when we encounter it later.
This should actually result in better behaviour as we don't have to
calculate the new members mask for a whole subtree whever we have the
suspicion something changed, but can delay it to the point where we
actually need the members mask.
This allows us to simplify things quite a bit, which is good, since
validating this cache for correctness is hard enough.
Fixes: #9512
The concept is redundant and predates the special chars that do the same
in ExecStar=. Let's settle on advertising just the latter, and hide
PermissionsStartOnly= from the docs (even if we continue supporting it).
In a way this is a follow-up for
a2d1fb882c, but adds a similar warning for
PIDFile=.
There's a much stronger case for doing this kind of notification in
tmpfiles.d (since it helps relating lines to each other for the purpose
of merging them). Doing this for PIDFile= is mostly about being
systematic and copying tmpfiles.d/ behaviour here.
While we are at it, let's also support relative filenames in PIDFile=
now, and prefix them with /run, to make them absolute.
Fixes: #10657
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
Let's replace usage of fputc_unlocked() and friends by __fsetlocking(f,
FSETLOCKING_BYCALLER). This turns off locking for the entire FILE*,
instead of doing individual per-call decision whether to use normal
calls or _unlocked() calls.
This has various benefits:
1. It's easier to read and easier not to forget
2. It's more comprehensive, as fprintf() and friends are covered too
(as these functions have no _unlocked() counterpart)
3. Philosophically, it's a bit more correct, because it's more a
property of the file handle really whether we ever pass it on to another
thread, not of the operations we then apply to it.
This patch reworks all pieces of codes that so far used fxyz_unlocked()
calls to use __fsetlocking() instead. It also reworks all places that
use open_memstream(), i.e. use stdio FILE* for string manipulations.
Note that this in some way a revert of 4b61c87511.
Let's make sure that when we return a D-Bus error, we return a native
one, if we generate it ourselves, and use errno-based error
synthetization only if we received an errno ourselves. Yes, this makes
things slightly longer, but is highly misleading as we propagate D-Bus
errors, and not errnos to the client.
This majorly refactors the transient unit file and drop-in writing
logic, so that we properly C-escape and specifier-escape (% → %%)
everything we write out, so that when we read it back again, specifiers
are parsed that aren't supposed to be parsed.
This renames unit_write_drop_in() and friends by unit_write_setting().
The name change is supposed to clarify that the functions are not only
used to write drop-in files, but also transient unit files.
The previous "mode" parameter to this function is replaced by a more
generic "flags", which knows additional flags for implicit C-style and
specifier escaping before writing things out. This can cover most
properties where either form of escaping is defined. For the cases where
this isn't sufficient, we add helpers unit_escape_setting() and
unit_concat_strv() for escaping individual strings or strvs properly.
While we are at it, we also prettify generation of transient unit files:
we try to reduce the number of section headers written out: previously
we'd write the right section header our for each setting. With this
change we do so only if the setting lives in a different section than
the one before.
(This should also be considered preparation for when we add proper APIs
to systemd to write normal, persistant unit files through the bus API)
This adds a per-service restart counter. Each time an automatic
restart is scheduled (due to Restart=) it is increased by one. Its
current value is exposed over the bus as NRestarts=. It is also logged
(in a structured, recognizable way) on each restart.
Note that this really only counts automatic starts triggered by Restart=
(which it nicely complements). Manual restarts will reset the counter,
as will explicit calls to "systemctl reset-failed". It's supposed to be
a tool for measure the automatic restart feature, and nothing else.
Fixes: #4126
As a follow-up for db3f45e2d2 let's do the
same for all other cases where we create a FILE* with local scope and
know that no other threads hence can have access to it.
For most cases this shouldn't change much really, but this should speed
dbus introspection and calender time formatting up a bit.
This adds the boolean RemoveIPC= setting to service, socket, mount and swap
units (i.e. all unit types that may invoke processes). if turned on, and the
unit's user/group is not root, all IPC objects of the user/group are removed
when the service is shut down. The life-cycle of the IPC objects is hence bound
to the unit life-cycle.
This is particularly relevant for units with dynamic users, as it is essential
that no objects owned by the dynamic users survive the service exiting. In
fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set.
In order to communicate the UID/GID of an executed process back to PID 1 this
adds a new "user lookup" socket pair, that is inherited into the forked
processes, and closed before the exec(). This is needed since we cannot do NSS
from PID 1 due to deadlock risks, However need to know the used UID/GID in
order to clean up IPC owned by it if the unit shuts down.
This makes it easier to discern the relevant and obsolete parts of the vtables,
and in particular helps when comparing introspection data with the actual
vtable definitions.
This moves the StartLimitBurst=, StartLimitInterval=, StartLimitAction=, RebootArgument= from the [Service] section
into the [Unit] section of unit files, and thus support it in all unit types, not just in services.
This way we can enforce the start limit much earlier, in particular before testing the unit conditions, so that
repeated start-up failure due to failed conditions is also considered for the start limit logic.
For compatibility the four options may also be configured in the [Service] section still, but we only document them in
their new section [Unit].
This also renamed the socket unit failure code "service-failed-permanent" into "service-start-limit-hit" to express
more clearly what it is about, after all it's only triggered through the start limit being hit.
Finally, the code in busname_trigger_notify() and socket_trigger_notify() is altered to become more alike.
Fixes: #2467
This clean-ups timeout handling in PID 1. Specifically, instead of storing 0 in internal timeout variables as
indication for a disabled timeout, use USEC_INFINITY which is in-line with how we do this in the rest of our code
(following the logic that 0 means "no", and USEC_INFINITY means "never").
This also replace all usec_t additions with invocations to usec_add(), so that USEC_INFINITY is properly propagated,
and sd-event considers it has indication for turning off the event source.
This also alters the deserialization of the units to restart timeouts from the time they were originally started from.
Before this patch timeouts would be restarted beginning with the time of the deserialization, which could lead to
artificially prolonged timeouts if a daemon reload took place.
Finally, a new RuntimeMaxSec= setting is introduced for service units, that specifies a maximum runtime after which a
specific service is forcibly terminated. This is useful to put time limits on time-intensive processing jobs.
This also simplifies the various xyz_spawn() calls of the various types in that explicit distruction of the timers is
removed, as that is done anyway by the state change handlers, and a state change is always done when the xyz_spawn()
calls fail.
Fixes: #2249