Usecase is to allow changing the final kill from SIGKILL to SIGQUIT which
should create a core dump useful for debugging why the service didn't stop
with the SIGTERM
The kernel added support for a new cgroup memory controller knob memory.min in
bf8d5d52ffe8 ("memcg: introduce memory.min") which was merged during v4.18
merge window.
Add MemoryMin to support memory.min.
This new setting is supposed to be useful in most cases where
"MountFlags=slave" is currently used, i.e. as an explicit way to run a
service in its own mount namespace and decouple propagation from all
mounts of the new mount namespace towards the host.
The effect of MountFlags=slave and PrivateMounts=yes is mostly the same,
as both cause a CLONE_NEWNS namespace to be opened, and both will result
in all mounts within it to be mounted MS_SLAVE. The difference is mostly
on the conceptual/philosophical level: configuring the propagation mode
is nothing people should have to think about, in particular as the
matter is not precisely easyto grok. Moreover, MountFlags= allows configuration
of "private" and "slave" modes which don't really make much sense to use
in real-life and are quite confusing. In particular PrivateMounts=private means
mounts made on the host stay pinned for good by the service which is
particularly nasty for removable media mount. And PrivateMounts=shared
is in most ways a NOP when used a alone...
The main technical difference between setting only MountFlags=slave or
only PrivateMounts=yes in a unit file is that the former remounts all
mounts to MS_SLAVE and leaves them there, while that latter remounts
them to MS_SHARED again right after. The latter is generally a nicer
approach, since it disables propagation, while MS_SHARED is afterwards
in effect, which is really nice as that means further namespacing down
the tree will get MS_SHARED logic by default and we unify how
applications see our mounts as we always pass them as MS_SHARED
regardless whether any mount namespacing is used or not.
The effect of PrivateMounts=yes was implied already by all the other
mount namespacing options. With this new option we add an explicit knob
for it, to request it without any other option used as well.
See: #4393
That way we can use it in nspawn.
Also, while we are at it, let's rename the call config_parse_rlimit(),
i.e. insert the "r", to clarify what kind of limit this is about.
This introduces a new setting TemporaryFileSystem=. This is useful
to hide files not relevant to the processes invoked by unit, while
necessary files or directories can be still accessed by combining
with Bind{,ReadOnly}Paths=.
This adds a simple condition/assert/match to the service manager, to
udev's .link handling and to networkd, for matching the kernel version
string.
In this version we only do fnmatch() based globbing, but we might want
to extend that to version comparisons later on, if we like, by slightly
extending the syntax with ">=", "<=", ">", "<" and "==" expressions.
Up until now, the behaviour in systemd has (mostly) been to silently
ignore failures to action unit directives that refer to an unavailble
controller. The addition of AssertControlGroupController and its
conditional counterpart allow explicit specification of the desired
behaviour when such a situation occurs.
As for how this can happen, it is possible that a particular controller
is not available in the cgroup hierarchy. One possible reason for this
is that, in the running kernel, the controller simply doesn't exist --
for example, the CPU controller in cgroup v2 has only recently been
merged and was out of tree until then. Another possibility is that the
controller exists, but has been forcibly disabled by `cgroup_disable=`
on the kernel command line.
In future this will also support whatever comes out of issue #7624,
`DefaultXAccounting=never`, or similar.
Using specifiers in these settings isn't particularly useful by itself,
but it unifies behaviour a bit. It's kinda surprising that What= in
mount units resolves specifies, but Where= does not. Hence let's add
that too. Also, it's surprising Where=/What= in mount units behaves
differently than in automount and swap units, hence resolve specifiers
there too. Then, Type= in mount units is nowadays an arbitrary,
sometimes non-trivial string (think fuse!), hence let's also expand
specifiers there, to match the rest of the mount settings.
This has the benefit that when writing code that generates unit files,
less care has to be taken to check whether escaping of specifiers is
necessary or not: broadly everything that takes arbitrary user strings
now does specifier expansion, while enums/numerics/booleans do not.
All other cases where we accept a reboot argument are decoded with
config_parse_unit_string_printf() rather than
config_parse_unit_path_printf(), and that's really the only thing what
makes sense here, hence adjust this here, too.
SuccessAction= is similar to FailureAction= but declares what to do on
success of a unit, rather than on failure. This is useful for running
commands in qemu/nspawn images, that shall power down on completion. We
frequently see "ExecStopPost=/usr/bin/systemctl poweroff" or so in unit
files like this. Offer a simple, more declarative alternative for this.
While we are at it, hook up failure action with unit_dump() and
transient units too.
Both permit configuring data to pass through STDIN to an invoked
process. StandardInputText= accepts a line of text (possibly with
embedded C-style escapes as well as unit specifiers), which is appended
to the buffer to pass as stdin, followed by a single newline.
StandardInputData= is similar, but accepts arbitrary base64 encoded
data, and will not resolve specifiers or C-style escapes, nor append
newlines.
This may be used to pass input/configuration data to services, directly
in-line from unit files, either in a cooked or in a more raw format.
Right now, the option only takes one of two possible values "inactive"
or "inactive-or-failed", the former being the default, and exposing same
behaviour as the status quo ante. If set to "inactive-or-failed" units
may be collected by the GC logic when in the "failed" state too.
This logic should be a nicer alternative to using the "-" modifier for
ExecStart= and friends, as the exit data is collected and logged about
and only removed when the GC comes along. This should be useful in
particular for per-connection socket-activated services, as well as
"systemd-run" command lines that shall leave no artifacts in the
system.
I was thinking about whether to expose this as a boolean, but opted for
an enum instead, as I have the suspicion other tweaks like this might be
a added later on, in which case we extend this setting instead of having
to add yet another one.
Also, let's add some documentation for the GC logic.
And let's make use of it to implement two new unit settings with it:
1. LogLevelMax= is a new per-unit setting that may be used to configure
log priority filtering: set it to LogLevelMax=notice and only
messages of level "notice" and lower (i.e. more important) will be
processed, all others are dropped.
2. LogExtraFields= is a new per-unit setting for configuring per-unit
journal fields, that are implicitly included in every log record
generated by the unit's processes. It takes field/value pairs in the
form of FOO=BAR.
Also, related to this, one exisiting unit setting is ported to this new
facility:
3. The invocation ID is now pulled from /run/systemd/units/ instead of
cgroupfs xattrs. This substantially relaxes requirements of systemd
on the kernel version and the privileges it runs with (specifically,
cgroupfs xattrs are not available in containers, since they are
stored in kernel memory, and hence are unsafe to permit to lesser
privileged code).
/run/systemd/units/ is a new directory, which contains a number of files
and symlinks encoding the above information. PID 1 creates and manages
these files, and journald reads them from there.
Note that this is supposed to be a direct path between PID 1 and the
journal only, due to the special runtime environment the journal runs
in. Normally, today we shouldn't introduce new interfaces that (mis-)use
a file system as IPC framework, and instead just an IPC system, but this
is very hard to do between the journal and PID 1, as long as the IPC
system is a subject PID 1 manages, and itself a client to the journal.
This patch cleans up a couple of types used in journal code:
specifically we switch to size_t for a couple of memory-sizing values,
as size_t is the right choice for everything that is memory.
Fixes: #4089Fixes: #3041Fixes: #4441
Previously it was not possible to select which controllers to enable for
a unit where Delegate=yes was set, as all controllers were enabled. With
this change, this is made configurable, and thus delegation units can
pick specifically what they want to manage themselves, and what they
don't care about.
PR #3865 introduced RemoveIPC= but the option is not listed in
load-fragment-gperf.gperf. So, the option could be used only via d-bus.
This adds RemoveIPC= in load-fragment-gperf.gperf. Then, now we can
set the option in unit files.
Fixes#7281.
We added JobRunningTimeoutSec= late, and Dracut configured only
JobTimeoutSec= to turn of root device timeouts before. With this change
we'll propagate a reset of JobTimeoutSec= into JobRunningTimeoutSec=,
but only if the latter wasn't set explicitly.
This should restore compatibility with older systemd versions.
Fixes: #6402
Usually, it's a good thing that we isolate the kernel session keyring
for the various services and disconnect them from the user keyring.
However, in case of the cryptsetup key caching we actually want that
multiple instances of the cryptsetup service can share the keys in the
root user's user keyring, hence we need to be able to disable this logic
for them.
This adds KeyringMode=inherit|private|shared:
inherit: don't do any keyring magic (this is the default in systemd --user)
private: a private keyring as before (default in systemd --system)
shared: the new setting
With this setting we can explicitly unset specific variables for
processes of a unit, as last step of assembling the environment block
for them. This is useful to fix#6407.
While we are at it, greatly expand the documentation on how the
environment block for forked off processes is assembled.
Add LockPersonality boolean to allow locking down personality(2)
system call so that the execution domain can't be changed.
This may be useful to improve security because odd emulations
may be poorly tested and source of vulnerabilities, while
system services shouldn't need any weird personalities.
Since busname units are only useful with kdbus, they weren't actively
used. This was dead code, only compile-tested. If busname units are
ever added back, it'll be cleaner to start from scratch (possibly reverting
parts of this patch).
This introduces {State,Cache,Log,Configuration}Directory= those are
similar to RuntimeDirectory=. They create the directories under
/var/lib, /var/cache/, /var/log, or /etc, respectively, with the mode
specified in {State,Cache,Log,Configuration}DirectoryMode=.
This also fixes#6391.
2d79a0bbb9 did that for TimeoutSec=,
89beff89ed did that for JobTimeoutSec=,
and 0004f698df did that for
x-systemd.device-timeout=. But after parsing x-systemd.device-timeout=xxx
we write it out as JobRunningTimeoutSec=xxx. Two options:
- write out JobRunningTimeoutSec=<a very big number>,
- change JobRunningTimeoutSec= to behave like the other options.
I think it would be confusing for JobRunningTimeoutSec= to have different
syntax then TimeoutSec= and JobTimeoutSec=, so this patch implements the
second option.
Fixes#6264, https://bugzilla.redhat.com/show_bug.cgi?id=1462378.
This adds two options that are useful for user units. In particular, it
is useful to check ConditionUser=!0 to not start for the root user.
Closes: #5187
Since commit 36c16a7cdd ("core: rework unit timeout handling, and add
new setting RuntimeMaxSec=") TimeoutSec=0 in mount units has
cause the mount to timeout immediately instead of never as documented.
There is a similar problem with Socket.TimeoutSec and Swap.TimeoutSec.
These are easily fixed using config_parse_sec_fix_0().
Automount.TimeoutIdleSec looks like it could have the same problem,
but doesn't because the kernel treats '0' as 'no timeout'.
It handle USEC_INFINITY correctly only because that constant has
the value '-1', and when round up, it becomes zero.
To avoid possible confusion, use config_parse_sec_fix_0() as well, and
explicitly handle USEC_INFINITY.
Unit.JobTimeoutSec starts when a job is enqueued in a transaction. The
introduced distinct Unit.JobRunningTimeoutSec starts only when the job starts
running (e.g. it groups all Exec* commands of a service or spans waiting for a
device period.)
Unit.JobRunningTimeoutSec is intended to be used by default instead of
Unit.JobTimeoutSec for device units where such behavior causes less confusion
(consider a job for a _netdev mount device, with this change the timeout will
start ticking only after the network is ready).
This is similar to RootDirectory= but mounts the root file system from a
block device or loopback file instead of another directory.
This reuses the image dissector code now used by nspawn and
gpt-auto-discovery.
This adds a boolean unit file setting MountAPIVFS=. If set, the three
main API VFS mounts will be mounted for the service. This only has an
effect on RootDirectory=, which it makes a ton times more useful.
(This is basically the /dev + /proc + /sys mounting code posted in the
original #4727, but rebased on current git, and with the automatic logic
replaced by explicit logic controlled by a unit file setting)
This adds two new settings BindPaths= and BindReadOnlyPaths=. They allow
defining arbitrary bind mounts specific to particular services. This is
particularly useful for services with RootDirectory= set as this permits making
specific bits of the host directory available to chrooted services.
The two new settings follow the concepts nspawn already possess in --bind= and
--bind-ro=, as well as the .nspawn settings Bind= and BindReadOnly= (and these
latter options should probably be renamed to BindPaths= and BindReadOnlyPaths=
too).
Fixes: #3439
Let's permit specifier expansion at a numbre of additional fields, where
arbitrary strings might be passed where this might be useful one day. (Or at
least where there's no clear reason where it wouldn't make sense to have.)
The RestrictNamespaces= takes yes, no or a list of namespaces types,
therefor config_parse_restrict_namespaces() is a bit complex and it
operates on the ExecContext, fix this by passing the offset of
ExecContext directly otherwise restricting namespaces won't work.