Commit graph

4881 commits

Author SHA1 Message Date
Chris Down 22bf131be2 cgroup: Support 0-value for memory protection directives
These make sense to be explicitly set at 0 (which has a different effect
than the default, since it can affect processing of `DefaultMemoryXXX`).

Without this, it's not easily possible to relinquish memory protection
for a subtree, which is not great.
2019-05-08 12:06:32 +01:00
Chris Down 7e7223b3d5 cgroup: Readd some plumbing for DefaultMemoryMin
Somehow these got lost in the previous PR, rendering DefaultMemoryMin
not very useful.
2019-05-08 12:06:32 +01:00
Lennart Poettering adb7b782f8
Merge pull request #12218 from keszybz/use-libmount-more
Use libmount more
2019-04-30 19:44:17 +02:00
Lennart Poettering 0892f3f999
Merge pull request #12420 from mrc0mmand/coccinelle-tweaks
Coccinelle improvements
2019-04-30 11:37:19 +02:00
Frantisek Sumsal ed0cb34682 tree-wide: code improvements suggested by Coccinelle 2019-04-30 09:39:07 +02:00
Ben Boeckel 5238e95759 codespell: fix spelling errors 2019-04-29 16:47:18 +02:00
Lennart Poettering d8974757c4
Merge pull request #12407 from keszybz/two-unrelated-cleanups
Two unrelated cleanups
2019-04-26 23:43:27 +02:00
Lennart Poettering 85318688cc chown-recursive: also check mode before we bypass 2019-04-26 08:31:08 +02:00
Zbigniew Jędrzejewski-Szmek c5b7ae0edb
Merge pull request #12074 from poettering/io-acct
expose IO stats on the bus and in "systemctl status" and "systemd-run --wait"
2019-04-25 11:59:37 +02:00
Zbigniew Jędrzejewski-Szmek c5322608a5 core: adjust unit_get_ancestor_memory_{low,min}() to work with units which don't have a CGroupContext
Coverity doesn't like the fact that unit_get_cgroup_context() returns NULL for
unit types that don't have a CGroupContext. We don't expect to call those
functions with such unit types, so this isn't an immediate problem, but we can
make things more robust by handling this case.

CID #1400683, #1400684.
2019-04-25 11:13:02 +02:00
Zbigniew Jędrzejewski-Szmek b6411f716c
Merge pull request #12332 from cdown/default_min
cgroup: Add support for propagation of memory.min
2019-04-25 11:06:45 +02:00
Jan Klötzke 99b43caf26 core: immediately trigger watchdog action on WATCHDOG=trigger
A service might be able to detect errors by itself that may require the
system to take the same action as if the service locked up. Add a
WATCHDOG=trigger state change notification to sd_notify() to let the
service manager know about the self-detected misery and instantly
trigger the configured watchdog behaviour.
2019-04-24 10:17:10 +02:00
Zbigniew Jędrzejewski-Szmek e2857b3d87 Add helper function for mnt_table_parse_{stream,mtab}
This wraps a few common steps. It is defined as inline function instead of in a
.c file to avoid having a .c file. With a .c file, we would have three choices:
- either link it into libshared, but then then libshared would have to be
  linked to libmount.
- or compile the .c file into each target separately. This has the disdvantage
  that configuration of every target has to be updated and stuff will be compiled
  multiple times anyway, which is not too different from keeping this in the
  header file.
- or create a new convenience library just for this. This also has the disadvantage
  that the every target would have to be updated, and a separate library for a
  10 line function seems overkill.

By keeping everything in a header file, we compile this a few times, but
otherwise it's the least painful option. The compiler can optimize most of the
function away, because it knows if 'source' is set or not.
2019-04-23 23:29:29 +02:00
Zbigniew Jędrzejewski-Szmek 13dcfe4661 shared/mount-util: convert to libmount
It seems better to use just a single parsing algorithm for /proc/self/mountinfo.

Also, unify the naming of variables in all places that use mnt_table_next_fs().
It makes it easier to compare the different call sites.
2019-04-23 23:29:29 +02:00
Anita Zhang 25cc30c4c8 core: support DisableControllers= for transient units 2019-04-22 11:52:08 -07:00
Chris Down 7ad5439e06 unit: Add DefaultMemoryMin 2019-04-16 18:45:04 +01:00
Chris Down 6264b85e92 cgroup: Create UNIT_DEFINE_ANCESTOR_MEMORY_LOOKUP
This is in preparation for creating unit_get_ancestor_memory_min.
2019-04-16 18:39:51 +01:00
Yu Watanabe dcab85be18 core: do not show TimeoutStopSec= in dump message if it is not set 2019-04-14 20:47:13 +09:00
Yu Watanabe 9c79f0e0a0 core: add assertion in two inline functions 2019-04-14 20:46:24 +09:00
Yu Watanabe 3bf0cb65f5 core: use BUS_DEFINE_PROPERTY_GET() macro at more places 2019-04-14 20:45:31 +09:00
Yu Watanabe 54c1a6ab8c core: change type of Service::timeout_abort_set to bool
Follow-up for dc653bf487 (#11211).
2019-04-14 20:13:47 +09:00
Jan Klötzke dc653bf487 service: handle abort stops with dedicated timeout
When shooting down a service with SIGABRT the user might want to have a
much longer stop timeout than on regular stops/shutdowns. Especially in
the face of short stop timeouts the time might not be sufficient to
write huge core dumps before the service is killed.

This commit adds a dedicated (Default)TimeoutAbortSec= timer that is
used when stopping a service via SIGABRT. In all other cases the
existing TimeoutStopSec= is used. The timer value is unset by default
to skip the special handling and use TimeoutStopSec= for state
'stop-watchdog' to keep the old behaviour.

If the service is in state 'stop-watchdog' and the service should be
stopped explicitly we still go to 'stop-sigterm' and re-apply the usual
TimeoutStopSec= timeout.
2019-04-12 17:32:52 +02:00
Chris Down c52db42b78 cgroup: Implement default propagation of MemoryLow with DefaultMemoryLow
In cgroup v2 we have protection tunables -- currently MemoryLow and
MemoryMin (there will be more in future for other resources, too). The
design of these protection tunables requires not only intermediate
cgroups to propagate protections, but also the units at the leaf of that
resource's operation to accept it (by setting MemoryLow or MemoryMin).

This makes sense from an low-level API design perspective, but it's a
good idea to also have a higher-level abstraction that can, by default,
propagate these resources to children recursively. In this patch, this
happens by having descendants set memory.low to N if their ancestor has
DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow
value.

Any affected unit can opt out of this propagation by manually setting
`MemoryLow` to some value in its unit configuration. A unit can also
stop further propagation by setting `DefaultMemoryLow=` with no
argument. This removes further propagation in the subtree, but has no
effect on the unit itself (for that, use `MemoryLow=0`).

Our use case in production is simplifying the configuration of machines
which heavily rely on memory protection tunables, but currently require
tweaking a huge number of unit files to make that a reality. This
directive makes that significantly less fragile, and decreases the risk
of misconfiguration.

After this patch is merged, I will implement DefaultMemoryMin= using the
same principles.
2019-04-12 17:23:58 +02:00
Lennart Poettering bc40a20ebe core: include IO data in per-unit resource log msg 2019-04-12 14:25:44 +02:00
Lennart Poettering fbe14fc9a7 croup: expose IO accounting data per unit
This was the last kind of accounting still not exposed on for each unit.
Let's fix that.

Note that this is a relatively simplistic approach: we don't expose
per-device stats, but sum them all up, much like cgtop does. This kind
of metric is probably the most interesting for most usecases, and covers
the "systemctl status" output best. If we want per-device stats one day
we can of course always add that eventually.
2019-04-12 14:25:44 +02:00
Lennart Poettering 83f18c91d0 core: use string_table_lookup() at more places 2019-04-12 14:25:44 +02:00
Lennart Poettering 9b2559a13e core: add new call unit_reset_accounting()
It's a simple wrapper for resetting both IP and CPU accounting in one
go.

This will become particularly useful when we also needs this to reset IO
accounting (to be added in a later commit).
2019-04-12 14:25:44 +02:00
Lennart Poettering cc6625212f core: no need to initialize ip_accounting twice 2019-04-12 14:25:44 +02:00
Lennart Poettering 0bbff7d638 cgroup: get rid of a local variable 2019-04-12 14:25:44 +02:00
Lennart Poettering 3661dc349e
Merge pull request #12217 from keszybz/unlocked-operations
Refactor how we do unlocked file operations
2019-04-12 13:51:53 +02:00
Zbigniew Jędrzejewski-Szmek 2fe21124a6 Add open_memstream_unlocked() wrapper 2019-04-12 11:44:57 +02:00
Zbigniew Jędrzejewski-Szmek b636d78aee core/smack-setup: add helper function for openat+fdopen
Unlocked operations are used in all three places. I don't see why just one was
special.

This also improves logging, since we don't just log the final component of the
path, but the full name.
2019-04-12 11:44:57 +02:00
Zbigniew Jędrzejewski-Szmek 41f6e627d7 Make fopen_temporary and fopen_temporary_label unlocked
This is partially a refactoring, but also makes many more places use
unlocked operations implicitly, i.e. all users of fopen_temporary().
AFAICT, the uses are always for short-lived files which are not shared
externally, and are just used within the same context. Locking is not
necessary.
2019-04-12 11:44:56 +02:00
Zbigniew Jędrzejewski-Szmek 17e4b07088 core: vodify one more call to mkdir
CID #1400460.
2019-04-12 09:05:02 +02:00
Yu Watanabe 01234e1fe7 tree-wide: drop several missing_*.h and import relevant headers from kernel-5.0 2019-04-11 19:00:37 +02:00
Lennart Poettering aa46c28418
Merge pull request #12153 from benjarobin/killall-show-not-killed
shutdown/killall: Show in the console the processes not yet killed
2019-04-11 18:58:43 +02:00
Lennart Poettering 54f802ff8a
Merge pull request #12037 from poettering/oom-state
add cgroupv2 oom killer event handling to service management
2019-04-11 18:57:47 +02:00
Lennart Poettering 4ff9bc2ea6 tree-wide: port users over to use new ERRNO_IS_ACCEPT_AGAIN() call 2019-04-10 22:11:18 +02:00
Benjamin Robin 763e7b5da6 core/killall: Add documentation about broadcast_signal() 2019-04-10 19:30:38 +02:00
Benjamin Robin 2c32f4f47d core/killall: Log the process names not killed after 10s 2019-04-10 19:27:38 +02:00
Lennart Poettering afcfaa695c core: implement OOMPolicy= and watch cgroups for OOM killings
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).

On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".

All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
2019-04-09 11:17:58 +02:00
Lennart Poettering a5b5aece01 service: beautify debug log message a bit 2019-04-09 11:17:58 +02:00
Lennart Poettering 0bb814c2c2 core: rename cgroup_inotify_wd → cgroup_control_inotify_wd
Let's rename the .cgroup_inotify_wd field of the Unit object to
.cgroup_control_inotify_wd. Let's similarly rename the hashmap
.cgroup_inotify_wd_unit of the Manager object to
.cgroup_control_inotify_wd_unit.

Why? As preparation for a later commit that allows us to watch the
"memory.events" cgroup attribute file in addition to the "cgroup.events"
file we already watch with the fields above. In that later commit we'll
add new fields "cgroup_memory_inotify_wd" to Unit and
"cgroup_memory_inotify_wd_unit" to Manager, that are used to watch these
other events file.

No change in behaviour. Just some renaming.
2019-04-09 11:17:57 +02:00
Lennart Poettering 5210387ea6 core: check for redundant operation before doing allocation 2019-04-09 11:17:57 +02:00
Lennart Poettering cbe83389d5 core: rearrange cgroup empty events a bit
So far the priorities for cgroup empty event handling were pretty weird.
The raw events (on cgroupsv2 from inotify, on cgroupsv1 from the agent
dgram socket) where scheduled at a lower priority than the cgroup empty
queue dispatcher. Let's swap that and ensure that we can coalesce events
more agressively: let's process the raw events at higher priority than
the cgroup empty event (which remains at the same prio).
2019-04-09 11:17:57 +02:00
Zbigniew Jędrzejewski-Szmek 9d1b2b2252 pid1,shutdown: do not cunescape paths from libmount
The test added in previous commit shows that libmount does the unescaping
internally.
2019-04-09 09:07:40 +02:00
Benjamin Robin a012f9f7cf core/killall: Propagate errors and return the number of process left 2019-04-08 19:41:16 +02:00
Zbigniew Jędrzejewski-Szmek fb36b1339b shared: add a single definition of libmount cleanup functions
Use a trivial header file to share mnt_free_tablep and mnt_free_iterp.
It would be nicer put this in mount-util.h, but libmount.h is not in the
default include path, and the build system would have to be adjusted to pass
pkg-config include path in various places, and it's just not worth the trouble.
A separate header file works nicely.
2019-04-05 10:18:21 +02:00
Zbigniew Jędrzejewski-Szmek 58f6ab4454 pid1: pass unit name to seccomp parser when we have no file location
Building on previous commit, let's pass the unit name when parsing
dbus message or builtin whitelist, which is better than nothing.

seccomp_parse_syscall_filter() is not needed anymore, so it is removed,
and seccomp_parse_syscall_filter_full() is renamed to take its place.
2019-04-03 09:17:42 +02:00
Zbigniew Jędrzejewski-Szmek e7ccdfa809 core: use a temporary variable for calculation of seccomp flags
I think it is easier to read this way.
2019-04-03 08:56:06 +02:00