Commit graph

296 commits

Author SHA1 Message Date
Yu Watanabe 657ee2d82b tree-wide: replace strjoin() with path_join() 2019-06-21 03:26:16 +09:00
Chris Down c710d3b430 cgroup: Prevent theoretical nullptr deref in unit mask calculation 2019-06-07 06:33:53 +01:00
Zbigniew Jędrzejewski-Szmek 84d2744bc5 Move warning about unsupported BPF firewall right before the firewall would be created
There's no need to warn about the firewall when parsing, because the unit might
not be started at all. Let's warn only when we're actually preparing to start
the firewall.

This changes behaviour:
- the warning is printed just once for all unit types, and not once
  for normal units and once for transient units.
- on repeat warnings, the message is not printed at all. There's already
  detailed debug info from bpf_firewall_compile(), so we don't need to repeat
  ourselves.
- when we are not root, let's say precisely that, not "lack of necessary privileges"
  and "the local system does not support BPF/cgroup firewalling".

Fixes #12673.
2019-06-04 17:22:37 +02:00
Zbigniew Jędrzejewski-Szmek 2ba6ae6b2b core: do an extra check if oom was triggered when handling sigchild
Should fix #12425.
2019-05-20 16:37:06 +02:00
Ben Boeckel 5238e95759 codespell: fix spelling errors 2019-04-29 16:47:18 +02:00
Lennart Poettering d8974757c4
Merge pull request #12407 from keszybz/two-unrelated-cleanups
Two unrelated cleanups
2019-04-26 23:43:27 +02:00
Zbigniew Jędrzejewski-Szmek c5b7ae0edb
Merge pull request #12074 from poettering/io-acct
expose IO stats on the bus and in "systemctl status" and "systemd-run --wait"
2019-04-25 11:59:37 +02:00
Zbigniew Jędrzejewski-Szmek c5322608a5 core: adjust unit_get_ancestor_memory_{low,min}() to work with units which don't have a CGroupContext
Coverity doesn't like the fact that unit_get_cgroup_context() returns NULL for
unit types that don't have a CGroupContext. We don't expect to call those
functions with such unit types, so this isn't an immediate problem, but we can
make things more robust by handling this case.

CID #1400683, #1400684.
2019-04-25 11:13:02 +02:00
Zbigniew Jędrzejewski-Szmek b6411f716c
Merge pull request #12332 from cdown/default_min
cgroup: Add support for propagation of memory.min
2019-04-25 11:06:45 +02:00
Anita Zhang 25cc30c4c8 core: support DisableControllers= for transient units 2019-04-22 11:52:08 -07:00
Chris Down 7ad5439e06 unit: Add DefaultMemoryMin 2019-04-16 18:45:04 +01:00
Chris Down 6264b85e92 cgroup: Create UNIT_DEFINE_ANCESTOR_MEMORY_LOOKUP
This is in preparation for creating unit_get_ancestor_memory_min.
2019-04-16 18:39:51 +01:00
Chris Down c52db42b78 cgroup: Implement default propagation of MemoryLow with DefaultMemoryLow
In cgroup v2 we have protection tunables -- currently MemoryLow and
MemoryMin (there will be more in future for other resources, too). The
design of these protection tunables requires not only intermediate
cgroups to propagate protections, but also the units at the leaf of that
resource's operation to accept it (by setting MemoryLow or MemoryMin).

This makes sense from an low-level API design perspective, but it's a
good idea to also have a higher-level abstraction that can, by default,
propagate these resources to children recursively. In this patch, this
happens by having descendants set memory.low to N if their ancestor has
DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow
value.

Any affected unit can opt out of this propagation by manually setting
`MemoryLow` to some value in its unit configuration. A unit can also
stop further propagation by setting `DefaultMemoryLow=` with no
argument. This removes further propagation in the subtree, but has no
effect on the unit itself (for that, use `MemoryLow=0`).

Our use case in production is simplifying the configuration of machines
which heavily rely on memory protection tunables, but currently require
tweaking a huge number of unit files to make that a reality. This
directive makes that significantly less fragile, and decreases the risk
of misconfiguration.

After this patch is merged, I will implement DefaultMemoryMin= using the
same principles.
2019-04-12 17:23:58 +02:00
Lennart Poettering fbe14fc9a7 croup: expose IO accounting data per unit
This was the last kind of accounting still not exposed on for each unit.
Let's fix that.

Note that this is a relatively simplistic approach: we don't expose
per-device stats, but sum them all up, much like cgtop does. This kind
of metric is probably the most interesting for most usecases, and covers
the "systemctl status" output best. If we want per-device stats one day
we can of course always add that eventually.
2019-04-12 14:25:44 +02:00
Lennart Poettering 9b2559a13e core: add new call unit_reset_accounting()
It's a simple wrapper for resetting both IP and CPU accounting in one
go.

This will become particularly useful when we also needs this to reset IO
accounting (to be added in a later commit).
2019-04-12 14:25:44 +02:00
Lennart Poettering 0bbff7d638 cgroup: get rid of a local variable 2019-04-12 14:25:44 +02:00
Lennart Poettering afcfaa695c core: implement OOMPolicy= and watch cgroups for OOM killings
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).

On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".

All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
2019-04-09 11:17:58 +02:00
Lennart Poettering 0bb814c2c2 core: rename cgroup_inotify_wd → cgroup_control_inotify_wd
Let's rename the .cgroup_inotify_wd field of the Unit object to
.cgroup_control_inotify_wd. Let's similarly rename the hashmap
.cgroup_inotify_wd_unit of the Manager object to
.cgroup_control_inotify_wd_unit.

Why? As preparation for a later commit that allows us to watch the
"memory.events" cgroup attribute file in addition to the "cgroup.events"
file we already watch with the fields above. In that later commit we'll
add new fields "cgroup_memory_inotify_wd" to Unit and
"cgroup_memory_inotify_wd_unit" to Manager, that are used to watch these
other events file.

No change in behaviour. Just some renaming.
2019-04-09 11:17:57 +02:00
Lennart Poettering 5210387ea6 core: check for redundant operation before doing allocation 2019-04-09 11:17:57 +02:00
Lennart Poettering cbe83389d5 core: rearrange cgroup empty events a bit
So far the priorities for cgroup empty event handling were pretty weird.
The raw events (on cgroupsv2 from inotify, on cgroupsv1 from the agent
dgram socket) where scheduled at a lower priority than the cgroup empty
queue dispatcher. Let's swap that and ensure that we can coalesce events
more agressively: let's process the raw events at higher priority than
the cgroup empty event (which remains at the same prio).
2019-04-09 11:17:57 +02:00
Franck Bui f75f613d25 core: reduce the number of stalled PIDs from the watched processes list when possible
Some PIDs can remain in the watched list even though their processes have
exited since a long time. It can easily happen if the main process of a forking
service manages to spawn a child before the control process exits for example.

However when a pid is about to be mapped to a unit by calling unit_watch_pid(),
the caller usually knows if the pid should belong to this unit exclusively: if
we just forked() off a child, then we can be sure that its PID is otherwise
unused. In this case we take this opportunity to remove any stalled PIDs from
the watched process list.

If we learnt about a PID in any other form (for example via PID file, via
searching, MAINPID= and so on), then we can't assume anything.
2019-03-20 10:51:49 +01:00
Franck Bui 4d05154600 process-util: introduce pid_is_my_child() helper
No functional changes.
2019-03-20 10:51:49 +01:00
Lennart Poettering d8b4d14df4 util: split out nulstr related stuff to nulstr-util.[ch] 2019-03-14 13:25:52 +01:00
Filipe Brandenburger 527ede0c63 core: downgrade CPUQuotaPeriodSec= clamping logs to debug
After the first warning log, further messages are downgraded to LOG_DEBUG.
2019-02-14 11:04:42 -08:00
Filipe Brandenburger 10f2864111 core: add CPUQuotaPeriodSec=
This new setting allows configuration of CFS period on the CPU cgroup, instead
of using a hardcoded default of 100ms.

Tested:
- Legacy cgroup + Unified cgroup
- systemctl set-property
- systemctl show
- Confirmed that the cgroup settings (such as cpu.cfs_period_ns) were set
  appropriately, including updating the CPU quota (cpu.cfs_quota_ns) when
  CPUQuotaPeriodSec= is updated.
- Checked that clamping works properly when either period or (quota * period)
  are below the resolution of 1ms, or if period is above the max of 1s.
2019-02-14 11:04:42 -08:00
Zbigniew Jędrzejewski-Szmek c482724aa5 procfs-util: expose functionality to query total memory
procfs_memory_get_current is renamed to procfs_memory_get_used, because
"current" can mean anything, including total memory, used memory, and free
memory, as long as the value is up to date.

No functional change.
2019-01-22 17:43:13 +01:00
YunQiang Su f5855697aa Pass separate dev_t var to device_path_parse_major_minor
MIPS/O32's st_rdev member of struct stat is unsigned long, which
is 32bit, while dev_t is defined as 64bit, which make some problems
in device_path_parse_major_minor.

Don't pass st.st_rdev, st_mode to device_path_parse_major_minor,
while pass 2 seperate variables. The result of stat is alos copied
out into these 2 variables. Fixes: #11247
2019-01-03 15:04:08 +01:00
Chris Down 4e1dfa45e9 cgroup: s/cgroups? ?v?([0-9])/cgroup v\1/gI
Nitpicky, but we've used a lot of random spacings and names in the past,
but we're trying to be completely consistent on "cgroup vN" now.

Generated by `fd -0 | xargs -0 -n1 sed -ri --follow-symlinks 's/cgroups?  ?v?([0-9])/cgroup v\1/gI'`.

I manually ignored places where it's not appropriate to replace (eg.
"cgroup2" fstype and in src/shared/linux).
2019-01-03 11:32:40 +09:00
Lennart Poettering 2d41e9b7a0
Merge pull request #11143 from keszybz/enable-symlink
Runtime mask symlink confusion fix
2018-12-16 12:37:07 +01:00
Chris Down cb5e3bc37d cgroup: Don't explicitly check for member in UNIT_BEFORE
The parent slice is always filtered ahead of time from UNIT_BEFORE, so
checking if the current member is the same as the parent unit will never
pass.

I may also write a SLICE_FOREACH_CHILD macro to remove some more of the
parent slice checks, but this requires a bit of a rework and general
refactoring and may not be worth it, so let's just do this for now.
2018-12-12 20:50:10 +01:00
Zbigniew Jędrzejewski-Szmek 303ee60151 Mark *data and *userdata params to specifier_printf() as const
It would be very wrong if any of the specfier printf calls modified
any of the objects or data being printed. Let's mark all arguments as const
(primarily to make it easier for the reader to see where modifications cannot
occur).
2018-12-12 16:45:33 +01:00
Lennart Poettering d742f4b54b cgroup: correct mangling of return values
Let's nor return the unmangled return value before we actually mangle
it.

Fixes: #11062
2018-12-10 16:09:41 +01:00
Lennart Poettering 92a993041a cgroup: call cg_all_unified() right before using the result
Let's not query it before we actually need it.
2018-12-10 16:09:41 +01:00
Lennart Poettering ea900d2bfe
Merge pull request #11009 from poettering/root-cgroup-again
tweak root cgroup attribute fiddling for cgroupsv1 again
2018-12-04 12:33:03 +01:00
Chris Down c72703e26d cgroup: Add DisableControllers= directive to disable controller in subtree
Some controllers (like the CPU controller) have a performance cost that
is non-trivial on certain workloads. While this can be mitigated and
improved to an extent, there will for some controllers always be some
overheads associated with the benefits gained from the controller.
Inside Facebook, the fix applied has been to disable the CPU controller
forcibly with `cgroup_disable=cpu` on the kernel command line.

This presents a problem: to disable or reenable the controller, a reboot
is required, but this is quite cumbersome and slow to do for many
thousands of machines, especially machines where disabling/enabling a
stateful service on a machine is a matter of several minutes.

Currently systemd provides some configuration knobs for these in the
form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the
like. The limitation of these is that Default*Accounting is overrideable
by individual services, of which any one could decide to reenable a
controller within the hierarchy at any point just by using a controller
feature implicitly (eg. `CPUWeight`), even if the use of that CPU
feature could just be opportunistic. Since many services are provided by
the distribution, or by upstream teams at a particular organisation,
it's not a sustainable solution to simply try to find and remove
offending directives from these units.

This commit presents a more direct solution -- a DisableControllers=
directive that forcibly disallows a controller from being enabled within
a subtree.
2018-12-03 15:40:31 +00:00
Chris Down 4f6f62e468 cgroup: Traverse leaves to realised cgroup to release controllers
This adds a depth-first version of unit_realize_cgroup_now which can
only do depth-first disabling of controllers, in preparation for the
DisableController= directive.
2018-12-03 14:37:39 +00:00
Chris Down a57669d290 cgroup: Rework unit_realize_cgroup_now to explicitly be breadth-first
systemd currently doesn't really expend much effort in disabling
controllers. unit_realize_cgroup_now *may* be able to disable a
controller in the basic case when using cgroup v2, but generally won't
manage as downstream dependents may still use it.

This code doesn't add any logic to fix that, but it starts the process
of moving to have a breadth-first version of unit_realize_cgroup_now for
enabling, and a depth-first version of unit_realize_cgroup_now for
disabling.
2018-12-03 14:37:39 +00:00
Chris Down 0d2d6fbf15 cgroup: Move attribute application into unit_create_cgroup
We always end up doing these together, so just colocate them and require
manager state for unit_create_cgroup.
2018-12-03 14:37:38 +00:00
Lennart Poettering 67e2ea1542 cgroup: suffix unit file settings with "=" in log output
Let's follow our recommendations from CODING_STYLE and suffix unit file
settings with "=" everywhere.
2018-12-01 12:57:51 +01:00
Lennart Poettering be2c032781 core: don't try to write CPU quota and memory limit cgroup attrs on root cgroup
In the kernel sources attempts to write to either are refused with
EINVAL. Not sure why these attributes are exported anyway on cgroupsv1,
but this means we really should ignore them altogether.

This simplifies our code as this means cgroupsv1 is more alike cgroupsv2
in this regard.

Fixes: #10969
2018-12-01 12:57:51 +01:00
Lennart Poettering d5aecba6e0 cgroup: use device_path_parse_major_minor() also for block device paths
Not only when we populate the "devices" cgroup controller we need
major/minor numbers, but for the io/blkio one it's the same, hence let's
use the same logic for both.
2018-11-29 20:21:39 +01:00
Lennart Poettering 846b3bd61e stat-util: add new APIs device_path_make_{major_minor|canonical}() and device_path_parse_major_minor()
device_path_make_{major_minor|canonical)  generate device node paths
given a mode_t and a dev_t. We have similar code all over the place,
let's unify this in one place. The former will generate a "/dev/char/"
or "/dev/block" path, and never go to disk. The latter then goes to disk
and resolves that path to the actual path of the device node.

device_path_parse_major_minor() reverses device_path_make_major_minor(),
also withozut going to disk.

We have similar code doing something like this at various places, let's
unify this in a single set of functions. This also allows us to teach
them special tricks, for example handling of the
/run/systemd/inaccessible/{blk|chr} device nodes, which we use for
masking device nodes, and which do not exist in /dev/char/* and
/dev/block/*
2018-11-29 20:21:39 +01:00
Lennart Poettering 8e8b5d2e6d cgroups: beef up DeviceAllow= syntax a bit
Previously we'd allow pattern expressions such as "char-input" to match
all input devices. Internally, this would look up the right major to
test in /proc/devices. With this commit the syntax is slightly extended:

- "char-*" can be used to match any kind of character device, and
  similar "block-*. This expression would work previously already, but
  instead of actually installing a wildcard match it would install many
  individual matches for everything listed in /proc/devices.

- "char-<MAJOR>" with "<MAJOR>" being a numerical parameter works now
  too. This allows clients to install whitelist items by specifying the
  major directly.

The main reason to add these is to provide limited compat support for
clients that for some reason contain whitelists with major/minor numbers
(such as OCI containers).
2018-11-29 20:21:39 +01:00
Lennart Poettering 74c48bf5a8 core: add special handling for devices cgroup allow lists for /dev/block/* and /dev/char/* device nodes
This adds some code to hanlde /dev/block/* and /dev/char/* device node
paths specially: instead of actually stat()ing them we'll just parse the
major/minor name from the name. This is useful 'hack' to allow clients
to install whitelists for devices that don't actually have to exist.

Also, let's similarly handle /run/systemd/inaccessible/{blk|chr}. This
allows us to simplify our built-in default whitelist to not require a
"ignore_enoent" mode for these nodes.

In general we should be careful with hardcoding major/minor numbers, but
in this case this should safe.
2018-11-29 20:03:56 +01:00
Lennart Poettering 5af8805872 cgroup: drastically simplify caching of cgroups members mask
Previously we tried to be smart: when a new unit appeared and it only
added controllers to the cgroup mask we'd update the cached members mask
in all parents by ORing in the controller flags in their cached values.
Unfortunately this was quite broken, as we missed some conditions when
this cache had to be reset (for example, when a unit got unloaded),
moreover the optimization doesn't work when a controller is removed
anyway (as in that case there's no other way for the parent to iterate
though all children if any other, remaining child unit still needs it).
Hence, let's simplify the logic substantially: instead of updating the
cache on the right events (which we didn't get right), let's simply
invalidate the cache, and generate it lazily when we encounter it later.
This should actually result in better behaviour as we don't have to
calculate the new members mask for a whole subtree whever we have the
suspicion something changed, but can delay it to the point where we
actually need the members mask.

This allows us to simplify things quite a bit, which is good, since
validating this cache for correctness is hard enough.

Fixes: #9512
2018-11-23 13:41:37 +01:00
Lennart Poettering 8a0d538815 cgroup: extend comment on what unit_release_cgroup() is for 2018-11-23 13:41:37 +01:00
Lennart Poettering 1fd3a10c38 cgroup: extend reasons when we realize the enable mask
After creating a cgroup we need to initialize its
"cgroup.subtree_control" file with the controllers its children want to
use. Currently we do so whenever the mkdir() on the cgroup succeeded,
i.e. when we know the cgroup is "fresh". Let's update the condition
slightly that we also do so when internally we assume a cgroup doesn't
exist yet, even if it already does (maybe left-over from a previous
run).

This shouldn't change anything IRL but make things a bit more robust.
2018-11-23 13:41:37 +01:00
Lennart Poettering d5095dcd30 cgroup: tighten call that detects whether we need to realize a unit's cgroup a bit, and comment why 2018-11-23 13:41:37 +01:00
Lennart Poettering 27c4ed790a cgroup: simplify check whether it makes sense to realize a cgroup 2018-11-23 13:41:37 +01:00
Lennart Poettering e00068e71f cgroup: in unit_invalidate_cgroup() actually modify invalidation mask
Previously this would manipulate the realization mask for invalidating
the realization. This is a bit ugly though as the realization mask's
primary purpose to is to reflect in which hierarchies a cgroup currently
exists, and it's probably a good idea to keep that in sync with
realities.

We nowadays have the an explicit fields for invalidating cgroup
controller information, the "cgroup_invalidated_mask", let's use this
one instead.

The effect is pretty much the same, as the main consumer of these masks
(unit_has_mask_realize()) checks both anyway.
2018-11-23 13:41:37 +01:00