Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller. This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.
The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written. However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well. In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
The "Ex" variant was originally only added for ExecStartXYZ= but it makes
sense to have feature parity for the rest of the exec command properties
as well (e.g. ExecReload=, ExecStop=, etc).
v2:
- do not watch mtime of transient and generated dirs
We'd reload the map after every transient unit we created, which we don't
need to do, since we create those units ourselves and know their fragment
path.
This reworks how we load units from disk. Instead of chasing symlinks every
time we are asked to load a unit by name, we slurp all symlinks from disk
and build two hashmaps:
1. from unit name to either alias target, or fragment on disk
(if an alias, we put just the target name in the hashmap, if a fragment
we put an absolute path, so we can distinguish both).
2. from a unit name to all aliases
Reading all this data can be pretty costly (40 ms) on my machine, so we keep it
around for reuse.
The advantage is that we can reliably know what all the aliases of a given unit
are. This means we can reliably load dropins under all names. This fixes#11972.
If for some reason we do not know some signal, instead of silently
skipping it, let's print it numerically. Likewise, 'show' is not the
right place to do value filtering for exit codes. If pid1 accepted it,
let's just print it with no fuss.
We were passing 1/4th of the size in bytes as argument. So depending
on the size of the array, either we'd only transfer a subset of values,
or we'd get an alignment error.
The info printed in this function is the same as the non-Ex version of the
property so there's no point double printing.
Other places that print ExecXYZEx= properties are left alone since the
displayed information is different.
Similar variables had differing names: unit, path, unit_path. We also
have file system paths in surrounding code. Let's make this easier for
the reader and use "dbus_path" consistently.
It had two users, but it is just a very thin wrapper around
unit_file_find_dropin_paths(), so using it seems more complicated than directly
invoking unit_file_find_dropin_paths() twice.
"systemctl --failed" suggested I pass "--all" to see units in the inactive
state as well. I thought this was not very useful. If you explicitly
asked for units in a specific state, then you already know you have
narrowed it down. And if you ran "systemctl --state=inactive", it is even
more strange to see this message.
@keszybz suggests we probably don't want to suggest "list-unit-files"
either :-). Let's only suggest that if the user passed "--state=inactive".
Finally, this means the output for "systemctl --failed" could be just
"0 loaded units listed". In this case, we don't need any highlight on that
text, to distinguish it from the hint. This matches "list-unit-files".
This also means we happen to avoid using red highlight, when there are zero
failed units, as if that itself was a failure. @kesbyz pointed out that
old behaviour was a bit weird.
let's add [static] where it was missing so far
Drop [static] on parameters that can be NULL.
Add an assert() around parameters that have [static] and can't be NULL
hence.
Add some "const" where it was forgotten.
Make possible to set NUMA allocation policy for manager. Manager's
policy is by default inherited to all forked off processes. However, it
is possible to override the policy on per-service basis. Currently we
support, these policies: default, prefer, bind, interleave, local.
See man 2 set_mempolicy for details on each policy.
Overall NUMA policy actually consists of two parts. Policy itself and
bitmask representing NUMA nodes where is policy effective. Node mask can
be specified using related option, NUMAMask. Default mask can be
overwritten on per-service level.
Originally, `systemctl cat` would match only active units, for example:
$ systemctl cat sshd.service
would cat the sshd.service unit file even if the service was inactive.
However:
$ systemctl cat ssh*
would show it only if it was active.
Let's unify the behavior and cat all unit files regardless of a state,
if no state was given explicitly to filter.
(before)
$ build/systemctl list-machines
Need to be root.
$ sudo build/systemctl list-machines
NAME STATE FAILED JOBS
krowka (host) running 0 0
rawhide running 0 0
2 machines listed.
(after)
$ build/systemctl list-machines
NAME STATE FAILED JOBS
krowka (host) running 0 0
rawhide n/a 0 0
2 machines listed.
The output for non-root is missing some bits of information, but we display
what information is missing nicely, and e.g. in the case when no machines are
running at all, or only VMs are running, the unprivileged output would be the
same as the privileged one.
In cgroup v2 we have protection tunables -- currently MemoryLow and
MemoryMin (there will be more in future for other resources, too). The
design of these protection tunables requires not only intermediate
cgroups to propagate protections, but also the units at the leaf of that
resource's operation to accept it (by setting MemoryLow or MemoryMin).
This makes sense from an low-level API design perspective, but it's a
good idea to also have a higher-level abstraction that can, by default,
propagate these resources to children recursively. In this patch, this
happens by having descendants set memory.low to N if their ancestor has
DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow
value.
Any affected unit can opt out of this propagation by manually setting
`MemoryLow` to some value in its unit configuration. A unit can also
stop further propagation by setting `DefaultMemoryLow=` with no
argument. This removes further propagation in the subtree, but has no
effect on the unit itself (for that, use `MemoryLow=0`).
Our use case in production is simplifying the configuration of machines
which heavily rely on memory protection tunables, but currently require
tweaking a huge number of unit files to make that a reality. This
directive makes that significantly less fragile, and decreases the risk
of misconfiguration.
After this patch is merged, I will implement DefaultMemoryMin= using the
same principles.
Some chattrs only work sensible if you set them right after opening a
file for create (think: FS_NOCOW_FL). Others only work when they are
applied when the file is fully written (think: FS_IMMUTABLE_FL). Let's
take that into account when copying files and applying a chattr to them.