Commit Graph

315 Commits

Author SHA1 Message Date
Chris Down bc0623df16 cgroup: analyze: Report memory configurations that deviate from systemd
This is the most basic consumer of the new systemd-vs-kernel checker,
both acting as a reasonable standalone exerciser of the code, and also
as a way for easy inspection of deviations from systemd internal state.
2019-10-03 15:06:25 +01:00
Chris Down 6dfb92823f cgroup: analyze: Match standard dump format
We're the only ones left using = as the delimiter, which looks really
weird in `systemd-analyze dump`. Use `: ` like everyone else.
2019-10-03 15:06:25 +01:00
Chris Down 74b5fb272f cgroup: Allow checking systemd-internal limits against the kernel
We currently don't have any mitigations against another privileged user
on the system messing with the cgroup hierarchy, bringing the system out
of line with what we've set in systemd. We also don't have any real way
to surface this to the user (we do have logs, but you have to know to
look in the first place).

There are a few possible solutions:

1. Maintaining our own cgroup tree with the new fsopen API and having a
   read-only copy for everyone else. However, there are some
   complications on this front, and this may be infeasible in some
   environments. I'd rate this as a longer term effort that's tangential
   to this patch.
2. Actively checking for changes with {fa,i}notify and changing them
   back afterwards to match our configuration again. This is also
   possible, but it's also good to have a way to do passive monitoring
   of the situation without taking hard action. Also, currently daemons
   like senpai do actually need to modify the tree behind systemd's
   back (although hopefully this should be more integrated soon).

This patch implements another option, where one can, on demand, monitor
deviations in cgroup memory configuration from systemd's internal state.
Currently the only consumer is `systemd-analyze dump`, but the interface
is generic enough that it can also be exposed elsewhere later (for
example, over D-Bus).

Currently only memory limit style properties are supported, but later I
also plan to expand this out to other properties that systemd should
have ultimate control over.
2019-10-03 15:06:25 +01:00
Zbigniew Jędrzejewski-Szmek 86e94d95d0
Merge pull request #13246 from keszybz/add-SystemdOptions-efi-variable
Add efi variable to augment /proc/cmdline
2019-10-03 12:19:44 +02:00
Chris Down 64fe532e90 cgroup: Respect DefaultMemoryMin when setting memory.min
This is an oversight from https://github.com/systemd/systemd/pull/12332.

Sadly the tests didn't catch it since it requires a real cgroup
hierarchy to see, and it wasn't seen in prod since we're only currently
using DefaultMemoryLow, not DefaultMemoryMin. :-(
2019-09-30 18:41:21 +01:00
Chris Down 7c9d2b7993 cgroup: Check ancestor memory min for unified memory config
Otherwise we might not enable it when we should, ie. DefaultMemoryMin is
set in a parent, but not MemoryMin in the current unit.
2019-09-30 18:24:26 +01:00
Pavel Hrdina 047f5d63d7 cgroup: introduce support for cgroup v2 CPUSET controller
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller.  This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.

The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written.  However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well.  In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
2019-09-24 15:16:07 +02:00
Zbigniew Jędrzejewski-Szmek fdb3decaa7 util-lib: move some functions from basic/cgroup-util to shared/cgroup-setup
This way less stuff needs to be in basic. Initially, I wanted to move all the
parts of cgroup-utils.[ch] that depend on efivars.[ch] to shared, because
efivars.[ch] is in shared/. Later on, I decide to split efivars.[ch], so the
move done in this patch is not necessary anymore. Nevertheless, it is still
valid on its own. If at some point we want to expose libbasic, it is better to
to not have stuff that belong in libshared there.
2019-09-16 18:08:00 +02:00
Zbigniew Jędrzejewski-Szmek d4d99bc6e4 basic/cgroup-util: let cgroup_unified_flush() return the detected hierarchy
This avoid the use of the global variable.

Also rename cgroup_unified_update() to cgroup_unified_cached() and
cgroup_unified_flush() to cgroup_unified() to better reflect their new roles.
2019-09-16 18:06:20 +02:00
Kai Krakow 2dbc45aea7 cgroup: Also set io.bfq.weight
Current kernels with BFQ scheduler do not yet set their IO weight
through "io.weight" but through "io.bfq.weight" (using a slightly
different interface supporting only default weights, not per-device
weights). This commit enables "IOWeight=" to just to that.

This patch may be dropped at some time later.

Github-Link: https://github.com/systemd/systemd/issues/7057
Signed-off-by: Kai Krakow <kai@kaishome.de>
2019-08-20 11:50:59 +02:00
Zbigniew Jędrzejewski-Szmek a505166845
Merge pull request #13096 from keszybz/unit-loading
Preparatory work for the unit loading rework
2019-07-19 21:47:10 +02:00
Zbigniew Jędrzejewski-Szmek f4c43a8115 pid1: do not say "(null)" if no disabled controllers
It looks like we made a mistake. The list is just empty, that's all.
2019-07-19 16:51:14 +02:00
Zbigniew Jędrzejewski-Szmek 217b7b33cc pid1: order jobs that execute processes with lower priority
We can meaningfully compare jobs for units which have cpu weight or nice set.
But non-exec units those have those set.

Starting non-exec jobs first allows us to get them out of the queue quickly,
and consider more jobs for starting.

If we have service A, and socket B, and service C which is after socket B,
and we want to start both A and C, and C has higher cpu weight, if we get
B out of the way first, we'll know that we can start both A and C, and we'll
start C first.

Also invert the comparisons using CMP() so they are always done left vs. right,
and negate when returning instead.

Follow-up for da8e178296.
2019-07-19 14:38:52 +09:00
Michael Olbrich da8e178296 job: make the run queue order deterministic
Jobs are added to the run queue in random order. This happens because most
jobs are added by iterating over the transaction or dependency hash maps.

As a result, jobs that can be executed at the same time are started in a
different order each time.
On small embedded devices this can cause a measurable jitter for the point
in time when a job starts (~100ms jitter for 10 units that are started in
random order).
This results is a similar jitter for the boot time. This is undesirable in
general and make optimizing the boot time a lot harder.
Also, jobs that should have a higher priority because the unit has a higher
CPU weight might get executed later than others.

Fix this by turning the job run_queue into a Prioq and sort by the
following criteria (use the next if the values are equal):
- CPU weight
- nice level
- unit type
- unit name

The last one is just there for deterministic sorting to avoid any jitter.
2019-07-18 10:28:39 +02:00
Zbigniew Jędrzejewski-Szmek 95b21cff0e Apply empty_to_root() in three more spots for safety 2019-07-15 18:39:26 +02:00
Kai Lüke fab347489f bpf-firewall: custom BPF programs through IP(Ingress|Egress)FilterPath=
Takes a single /sys/fs/bpf/pinned_prog string as argument, but may be
specified multiple times. An empty assignment resets all previous filters.

Closes https://github.com/systemd/systemd/issues/10227
2019-06-25 09:56:16 +02:00
Yu Watanabe 270384b2d4 tree-wide: replace strjoina() with prefix_roota() 2019-06-25 01:31:26 +09:00
Lennart Poettering cee97d5768
Merge pull request #12836 from yuwata/tree-wide-replace-strjoin
tree-wide: replace strjoin() with path_join()
2019-06-22 20:02:46 +02:00
Yu Watanabe 657ee2d82b tree-wide: replace strjoin() with path_join() 2019-06-21 03:26:16 +09:00
Donald Buczek 0219b3524f cgroup: Continue unit reset if cgroup is busy
When part of the cgroup hierarchy cannot be deleted (e.g. because there
are still processes in it), do not exit unit_prune_cgroup early, but
continue so that u->cgroup_realized is reset.

Log the known case of non-empty cgroups at debug level and other errors
at warning level.

Fixes https://github.com/systemd/systemd/issues/12386
2019-06-20 10:16:53 +02:00
Chris Down c710d3b430 cgroup: Prevent theoretical nullptr deref in unit mask calculation 2019-06-07 06:33:53 +01:00
Zbigniew Jędrzejewski-Szmek 84d2744bc5 Move warning about unsupported BPF firewall right before the firewall would be created
There's no need to warn about the firewall when parsing, because the unit might
not be started at all. Let's warn only when we're actually preparing to start
the firewall.

This changes behaviour:
- the warning is printed just once for all unit types, and not once
  for normal units and once for transient units.
- on repeat warnings, the message is not printed at all. There's already
  detailed debug info from bpf_firewall_compile(), so we don't need to repeat
  ourselves.
- when we are not root, let's say precisely that, not "lack of necessary privileges"
  and "the local system does not support BPF/cgroup firewalling".

Fixes #12673.
2019-06-04 17:22:37 +02:00
Zbigniew Jędrzejewski-Szmek 2ba6ae6b2b core: do an extra check if oom was triggered when handling sigchild
Should fix #12425.
2019-05-20 16:37:06 +02:00
Ben Boeckel 5238e95759 codespell: fix spelling errors 2019-04-29 16:47:18 +02:00
Lennart Poettering d8974757c4
Merge pull request #12407 from keszybz/two-unrelated-cleanups
Two unrelated cleanups
2019-04-26 23:43:27 +02:00
Zbigniew Jędrzejewski-Szmek c5b7ae0edb
Merge pull request #12074 from poettering/io-acct
expose IO stats on the bus and in "systemctl status" and "systemd-run --wait"
2019-04-25 11:59:37 +02:00
Zbigniew Jędrzejewski-Szmek c5322608a5 core: adjust unit_get_ancestor_memory_{low,min}() to work with units which don't have a CGroupContext
Coverity doesn't like the fact that unit_get_cgroup_context() returns NULL for
unit types that don't have a CGroupContext. We don't expect to call those
functions with such unit types, so this isn't an immediate problem, but we can
make things more robust by handling this case.

CID #1400683, #1400684.
2019-04-25 11:13:02 +02:00
Zbigniew Jędrzejewski-Szmek b6411f716c
Merge pull request #12332 from cdown/default_min
cgroup: Add support for propagation of memory.min
2019-04-25 11:06:45 +02:00
Anita Zhang 25cc30c4c8 core: support DisableControllers= for transient units 2019-04-22 11:52:08 -07:00
Chris Down 7ad5439e06 unit: Add DefaultMemoryMin 2019-04-16 18:45:04 +01:00
Chris Down 6264b85e92 cgroup: Create UNIT_DEFINE_ANCESTOR_MEMORY_LOOKUP
This is in preparation for creating unit_get_ancestor_memory_min.
2019-04-16 18:39:51 +01:00
Chris Down c52db42b78 cgroup: Implement default propagation of MemoryLow with DefaultMemoryLow
In cgroup v2 we have protection tunables -- currently MemoryLow and
MemoryMin (there will be more in future for other resources, too). The
design of these protection tunables requires not only intermediate
cgroups to propagate protections, but also the units at the leaf of that
resource's operation to accept it (by setting MemoryLow or MemoryMin).

This makes sense from an low-level API design perspective, but it's a
good idea to also have a higher-level abstraction that can, by default,
propagate these resources to children recursively. In this patch, this
happens by having descendants set memory.low to N if their ancestor has
DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow
value.

Any affected unit can opt out of this propagation by manually setting
`MemoryLow` to some value in its unit configuration. A unit can also
stop further propagation by setting `DefaultMemoryLow=` with no
argument. This removes further propagation in the subtree, but has no
effect on the unit itself (for that, use `MemoryLow=0`).

Our use case in production is simplifying the configuration of machines
which heavily rely on memory protection tunables, but currently require
tweaking a huge number of unit files to make that a reality. This
directive makes that significantly less fragile, and decreases the risk
of misconfiguration.

After this patch is merged, I will implement DefaultMemoryMin= using the
same principles.
2019-04-12 17:23:58 +02:00
Lennart Poettering fbe14fc9a7 croup: expose IO accounting data per unit
This was the last kind of accounting still not exposed on for each unit.
Let's fix that.

Note that this is a relatively simplistic approach: we don't expose
per-device stats, but sum them all up, much like cgtop does. This kind
of metric is probably the most interesting for most usecases, and covers
the "systemctl status" output best. If we want per-device stats one day
we can of course always add that eventually.
2019-04-12 14:25:44 +02:00
Lennart Poettering 9b2559a13e core: add new call unit_reset_accounting()
It's a simple wrapper for resetting both IP and CPU accounting in one
go.

This will become particularly useful when we also needs this to reset IO
accounting (to be added in a later commit).
2019-04-12 14:25:44 +02:00
Lennart Poettering 0bbff7d638 cgroup: get rid of a local variable 2019-04-12 14:25:44 +02:00
Lennart Poettering afcfaa695c core: implement OOMPolicy= and watch cgroups for OOM killings
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).

On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".

All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
2019-04-09 11:17:58 +02:00
Lennart Poettering 0bb814c2c2 core: rename cgroup_inotify_wd → cgroup_control_inotify_wd
Let's rename the .cgroup_inotify_wd field of the Unit object to
.cgroup_control_inotify_wd. Let's similarly rename the hashmap
.cgroup_inotify_wd_unit of the Manager object to
.cgroup_control_inotify_wd_unit.

Why? As preparation for a later commit that allows us to watch the
"memory.events" cgroup attribute file in addition to the "cgroup.events"
file we already watch with the fields above. In that later commit we'll
add new fields "cgroup_memory_inotify_wd" to Unit and
"cgroup_memory_inotify_wd_unit" to Manager, that are used to watch these
other events file.

No change in behaviour. Just some renaming.
2019-04-09 11:17:57 +02:00
Lennart Poettering 5210387ea6 core: check for redundant operation before doing allocation 2019-04-09 11:17:57 +02:00
Lennart Poettering cbe83389d5 core: rearrange cgroup empty events a bit
So far the priorities for cgroup empty event handling were pretty weird.
The raw events (on cgroupsv2 from inotify, on cgroupsv1 from the agent
dgram socket) where scheduled at a lower priority than the cgroup empty
queue dispatcher. Let's swap that and ensure that we can coalesce events
more agressively: let's process the raw events at higher priority than
the cgroup empty event (which remains at the same prio).
2019-04-09 11:17:57 +02:00
Franck Bui f75f613d25 core: reduce the number of stalled PIDs from the watched processes list when possible
Some PIDs can remain in the watched list even though their processes have
exited since a long time. It can easily happen if the main process of a forking
service manages to spawn a child before the control process exits for example.

However when a pid is about to be mapped to a unit by calling unit_watch_pid(),
the caller usually knows if the pid should belong to this unit exclusively: if
we just forked() off a child, then we can be sure that its PID is otherwise
unused. In this case we take this opportunity to remove any stalled PIDs from
the watched process list.

If we learnt about a PID in any other form (for example via PID file, via
searching, MAINPID= and so on), then we can't assume anything.
2019-03-20 10:51:49 +01:00
Franck Bui 4d05154600 process-util: introduce pid_is_my_child() helper
No functional changes.
2019-03-20 10:51:49 +01:00
Lennart Poettering d8b4d14df4 util: split out nulstr related stuff to nulstr-util.[ch] 2019-03-14 13:25:52 +01:00
Filipe Brandenburger 527ede0c63 core: downgrade CPUQuotaPeriodSec= clamping logs to debug
After the first warning log, further messages are downgraded to LOG_DEBUG.
2019-02-14 11:04:42 -08:00
Filipe Brandenburger 10f2864111 core: add CPUQuotaPeriodSec=
This new setting allows configuration of CFS period on the CPU cgroup, instead
of using a hardcoded default of 100ms.

Tested:
- Legacy cgroup + Unified cgroup
- systemctl set-property
- systemctl show
- Confirmed that the cgroup settings (such as cpu.cfs_period_ns) were set
  appropriately, including updating the CPU quota (cpu.cfs_quota_ns) when
  CPUQuotaPeriodSec= is updated.
- Checked that clamping works properly when either period or (quota * period)
  are below the resolution of 1ms, or if period is above the max of 1s.
2019-02-14 11:04:42 -08:00
Zbigniew Jędrzejewski-Szmek c482724aa5 procfs-util: expose functionality to query total memory
procfs_memory_get_current is renamed to procfs_memory_get_used, because
"current" can mean anything, including total memory, used memory, and free
memory, as long as the value is up to date.

No functional change.
2019-01-22 17:43:13 +01:00
YunQiang Su f5855697aa Pass separate dev_t var to device_path_parse_major_minor
MIPS/O32's st_rdev member of struct stat is unsigned long, which
is 32bit, while dev_t is defined as 64bit, which make some problems
in device_path_parse_major_minor.

Don't pass st.st_rdev, st_mode to device_path_parse_major_minor,
while pass 2 seperate variables. The result of stat is alos copied
out into these 2 variables. Fixes: #11247
2019-01-03 15:04:08 +01:00
Chris Down 4e1dfa45e9 cgroup: s/cgroups? ?v?([0-9])/cgroup v\1/gI
Nitpicky, but we've used a lot of random spacings and names in the past,
but we're trying to be completely consistent on "cgroup vN" now.

Generated by `fd -0 | xargs -0 -n1 sed -ri --follow-symlinks 's/cgroups?  ?v?([0-9])/cgroup v\1/gI'`.

I manually ignored places where it's not appropriate to replace (eg.
"cgroup2" fstype and in src/shared/linux).
2019-01-03 11:32:40 +09:00
Lennart Poettering 2d41e9b7a0
Merge pull request #11143 from keszybz/enable-symlink
Runtime mask symlink confusion fix
2018-12-16 12:37:07 +01:00
Chris Down cb5e3bc37d cgroup: Don't explicitly check for member in UNIT_BEFORE
The parent slice is always filtered ahead of time from UNIT_BEFORE, so
checking if the current member is the same as the parent unit will never
pass.

I may also write a SLICE_FOREACH_CHILD macro to remove some more of the
parent slice checks, but this requires a bit of a rework and general
refactoring and may not be worth it, so let's just do this for now.
2018-12-12 20:50:10 +01:00
Zbigniew Jędrzejewski-Szmek 303ee60151 Mark *data and *userdata params to specifier_printf() as const
It would be very wrong if any of the specfier printf calls modified
any of the objects or data being printed. Let's mark all arguments as const
(primarily to make it easier for the reader to see where modifications cannot
occur).
2018-12-12 16:45:33 +01:00