v2:
- if RestartKillSignal= is not specified, fall back to KillSignal=. This is necessary
to preserve backwards compatibility (and keep KillSignal= generally useful).
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller. This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.
The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written. However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well. In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
"ratelimit" is a real word, so we don't need to use the other form anywhere.
We had both forms in various places, let's standarize on the shorter and more
correct one.
People do have usernames with dots, and it makes them very unhappy that systemd
doesn't like their that. It seems that there is no actual problem with allowing
dots in the username. In particular chown declares ":" as the official
separator, and internally in systemd we never rely on "." as the seperator
between user and group (nor do we call chown directly). Using dots in the name
is probably not a very good idea, but we don't need to care. Debian tools
(adduser) do not allow users with dots to be created.
This patch allows *existing* names with dots to be used in User, Group,
SupplementaryGroups, SocketUser, SocketGroup fields, both in unit files and on
the command line. DynamicUsers and sysusers still follow the strict policy.
user@.service and tmpfiles already allowed arbitrary user names, and this
remains unchanged.
Fixes#12754.
The implementation is pretty straight-foward: when we get a request to
clean some type of resources we fork off a process doing that, and while
it is running we are in the "cleaning" state.
Takes a single /sys/fs/bpf/pinned_prog string as argument, but may be
specified multiple times. An empty assignment resets all previous filters.
Closes https://github.com/systemd/systemd/issues/10227
Make possible to set NUMA allocation policy for manager. Manager's
policy is by default inherited to all forked off processes. However, it
is possible to override the policy on per-service basis. Currently we
support, these policies: default, prefer, bind, interleave, local.
See man 2 set_mempolicy for details on each policy.
Overall NUMA policy actually consists of two parts. Policy itself and
bitmask representing NUMA nodes where is policy effective. Node mask can
be specified using related option, NUMAMask. Default mask can be
overwritten on per-service level.
When shooting down a service with SIGABRT the user might want to have a
much longer stop timeout than on regular stops/shutdowns. Especially in
the face of short stop timeouts the time might not be sufficient to
write huge core dumps before the service is killed.
This commit adds a dedicated (Default)TimeoutAbortSec= timer that is
used when stopping a service via SIGABRT. In all other cases the
existing TimeoutStopSec= is used. The timer value is unset by default
to skip the special handling and use TimeoutStopSec= for state
'stop-watchdog' to keep the old behaviour.
If the service is in state 'stop-watchdog' and the service should be
stopped explicitly we still go to 'stop-sigterm' and re-apply the usual
TimeoutStopSec= timeout.
In cgroup v2 we have protection tunables -- currently MemoryLow and
MemoryMin (there will be more in future for other resources, too). The
design of these protection tunables requires not only intermediate
cgroups to propagate protections, but also the units at the leaf of that
resource's operation to accept it (by setting MemoryLow or MemoryMin).
This makes sense from an low-level API design perspective, but it's a
good idea to also have a higher-level abstraction that can, by default,
propagate these resources to children recursively. In this patch, this
happens by having descendants set memory.low to N if their ancestor has
DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow
value.
Any affected unit can opt out of this propagation by manually setting
`MemoryLow` to some value in its unit configuration. A unit can also
stop further propagation by setting `DefaultMemoryLow=` with no
argument. This removes further propagation in the subtree, but has no
effect on the unit itself (for that, use `MemoryLow=0`).
Our use case in production is simplifying the configuration of machines
which heavily rely on memory protection tunables, but currently require
tweaking a huge number of unit files to make that a reality. This
directive makes that significantly less fragile, and decreases the risk
of misconfiguration.
After this patch is merged, I will implement DefaultMemoryMin= using the
same principles.
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).
On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".
All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
Let services use a private UTS namespace. In addition, a seccomp filter is
installed on set{host,domain}name and a ro bind mounts on
/proc/sys/kernel/{host,domain}name.
This new setting allows configuration of CFS period on the CPU cgroup, instead
of using a hardcoded default of 100ms.
Tested:
- Legacy cgroup + Unified cgroup
- systemctl set-property
- systemctl show
- Confirmed that the cgroup settings (such as cpu.cfs_period_ns) were set
appropriately, including updating the CPU quota (cpu.cfs_quota_ns) when
CPUQuotaPeriodSec= is updated.
- Checked that clamping works properly when either period or (quota * period)
are below the resolution of 1ms, or if period is above the max of 1s.
Some controllers (like the CPU controller) have a performance cost that
is non-trivial on certain workloads. While this can be mitigated and
improved to an extent, there will for some controllers always be some
overheads associated with the benefits gained from the controller.
Inside Facebook, the fix applied has been to disable the CPU controller
forcibly with `cgroup_disable=cpu` on the kernel command line.
This presents a problem: to disable or reenable the controller, a reboot
is required, but this is quite cumbersome and slow to do for many
thousands of machines, especially machines where disabling/enabling a
stateful service on a machine is a matter of several minutes.
Currently systemd provides some configuration knobs for these in the
form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the
like. The limitation of these is that Default*Accounting is overrideable
by individual services, of which any one could decide to reenable a
controller within the hierarchy at any point just by using a controller
feature implicitly (eg. `CPUWeight`), even if the use of that CPU
feature could just be opportunistic. Since many services are provided by
the distribution, or by upstream teams at a particular organisation,
it's not a sustainable solution to simply try to find and remove
offending directives from these units.
This commit presents a more direct solution -- a DisableControllers=
directive that forcibly disallows a controller from being enabled within
a subtree.
This adds SuccessActionExitStatus= and FailureActionExitStatus= that may
be used to configure the exit status to propagate in when
SuccessAction=exit or FailureAction=exit is used.
When not specified let's also propagate the exit status of the main
process we fork off for the unit.
In a way this is a follow-up for
a2d1fb882c, but adds a similar warning for
PIDFile=.
There's a much stronger case for doing this kind of notification in
tmpfiles.d (since it helps relating lines to each other for the purpose
of merging them). Doing this for PIDFile= is mostly about being
systematic and copying tmpfiles.d/ behaviour here.
While we are at it, let's also support relative filenames in PIDFile=
now, and prefix them with /run, to make them absolute.
Fixes: #10657
Add LogRateLimitIntervalSec= and LogRateLimitBurst= options for
services. If provided, these values get passed to the journald
client context, and those values are used in the rate limiting
function in the journal over the the journald.conf values.
Part of #10230
Allows configuring the watchdog signal (with a default of SIGABRT).
This allows an alternative to SIGABRT when coredumps are not desirable.
Appropriate references to SIGABRT or aborting were renamed to reflect
more liberal watchdog signals.
Closes#8658
Usecase is to allow changing the final kill from SIGKILL to SIGQUIT which
should create a core dump useful for debugging why the service didn't stop
with the SIGTERM
The kernel added support for a new cgroup memory controller knob memory.min in
bf8d5d52ffe8 ("memcg: introduce memory.min") which was merged during v4.18
merge window.
Add MemoryMin to support memory.min.
This new setting is supposed to be useful in most cases where
"MountFlags=slave" is currently used, i.e. as an explicit way to run a
service in its own mount namespace and decouple propagation from all
mounts of the new mount namespace towards the host.
The effect of MountFlags=slave and PrivateMounts=yes is mostly the same,
as both cause a CLONE_NEWNS namespace to be opened, and both will result
in all mounts within it to be mounted MS_SLAVE. The difference is mostly
on the conceptual/philosophical level: configuring the propagation mode
is nothing people should have to think about, in particular as the
matter is not precisely easyto grok. Moreover, MountFlags= allows configuration
of "private" and "slave" modes which don't really make much sense to use
in real-life and are quite confusing. In particular PrivateMounts=private means
mounts made on the host stay pinned for good by the service which is
particularly nasty for removable media mount. And PrivateMounts=shared
is in most ways a NOP when used a alone...
The main technical difference between setting only MountFlags=slave or
only PrivateMounts=yes in a unit file is that the former remounts all
mounts to MS_SLAVE and leaves them there, while that latter remounts
them to MS_SHARED again right after. The latter is generally a nicer
approach, since it disables propagation, while MS_SHARED is afterwards
in effect, which is really nice as that means further namespacing down
the tree will get MS_SHARED logic by default and we unify how
applications see our mounts as we always pass them as MS_SHARED
regardless whether any mount namespacing is used or not.
The effect of PrivateMounts=yes was implied already by all the other
mount namespacing options. With this new option we add an explicit knob
for it, to request it without any other option used as well.
See: #4393
That way we can use it in nspawn.
Also, while we are at it, let's rename the call config_parse_rlimit(),
i.e. insert the "r", to clarify what kind of limit this is about.
This introduces a new setting TemporaryFileSystem=. This is useful
to hide files not relevant to the processes invoked by unit, while
necessary files or directories can be still accessed by combining
with Bind{,ReadOnly}Paths=.
This adds a simple condition/assert/match to the service manager, to
udev's .link handling and to networkd, for matching the kernel version
string.
In this version we only do fnmatch() based globbing, but we might want
to extend that to version comparisons later on, if we like, by slightly
extending the syntax with ">=", "<=", ">", "<" and "==" expressions.
Up until now, the behaviour in systemd has (mostly) been to silently
ignore failures to action unit directives that refer to an unavailble
controller. The addition of AssertControlGroupController and its
conditional counterpart allow explicit specification of the desired
behaviour when such a situation occurs.
As for how this can happen, it is possible that a particular controller
is not available in the cgroup hierarchy. One possible reason for this
is that, in the running kernel, the controller simply doesn't exist --
for example, the CPU controller in cgroup v2 has only recently been
merged and was out of tree until then. Another possibility is that the
controller exists, but has been forcibly disabled by `cgroup_disable=`
on the kernel command line.
In future this will also support whatever comes out of issue #7624,
`DefaultXAccounting=never`, or similar.
Using specifiers in these settings isn't particularly useful by itself,
but it unifies behaviour a bit. It's kinda surprising that What= in
mount units resolves specifies, but Where= does not. Hence let's add
that too. Also, it's surprising Where=/What= in mount units behaves
differently than in automount and swap units, hence resolve specifiers
there too. Then, Type= in mount units is nowadays an arbitrary,
sometimes non-trivial string (think fuse!), hence let's also expand
specifiers there, to match the rest of the mount settings.
This has the benefit that when writing code that generates unit files,
less care has to be taken to check whether escaping of specifiers is
necessary or not: broadly everything that takes arbitrary user strings
now does specifier expansion, while enums/numerics/booleans do not.
All other cases where we accept a reboot argument are decoded with
config_parse_unit_string_printf() rather than
config_parse_unit_path_printf(), and that's really the only thing what
makes sense here, hence adjust this here, too.
SuccessAction= is similar to FailureAction= but declares what to do on
success of a unit, rather than on failure. This is useful for running
commands in qemu/nspawn images, that shall power down on completion. We
frequently see "ExecStopPost=/usr/bin/systemctl poweroff" or so in unit
files like this. Offer a simple, more declarative alternative for this.
While we are at it, hook up failure action with unit_dump() and
transient units too.