Commit graph

165 commits

Author SHA1 Message Date
Chris Down c52db42b78 cgroup: Implement default propagation of MemoryLow with DefaultMemoryLow
In cgroup v2 we have protection tunables -- currently MemoryLow and
MemoryMin (there will be more in future for other resources, too). The
design of these protection tunables requires not only intermediate
cgroups to propagate protections, but also the units at the leaf of that
resource's operation to accept it (by setting MemoryLow or MemoryMin).

This makes sense from an low-level API design perspective, but it's a
good idea to also have a higher-level abstraction that can, by default,
propagate these resources to children recursively. In this patch, this
happens by having descendants set memory.low to N if their ancestor has
DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow
value.

Any affected unit can opt out of this propagation by manually setting
`MemoryLow` to some value in its unit configuration. A unit can also
stop further propagation by setting `DefaultMemoryLow=` with no
argument. This removes further propagation in the subtree, but has no
effect on the unit itself (for that, use `MemoryLow=0`).

Our use case in production is simplifying the configuration of machines
which heavily rely on memory protection tunables, but currently require
tweaking a huge number of unit files to make that a reality. This
directive makes that significantly less fragile, and decreases the risk
of misconfiguration.

After this patch is merged, I will implement DefaultMemoryMin= using the
same principles.
2019-04-12 17:23:58 +02:00
Lennart Poettering afcfaa695c core: implement OOMPolicy= and watch cgroups for OOM killings
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).

On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".

All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
2019-04-09 11:17:58 +02:00
Lennart Poettering f69567cbe2 core: expose SUID/SGID restriction as new unit setting RestrictSUIDSGID= 2019-04-02 16:56:48 +02:00
Lennart Poettering efebb613c7 core: optionally, trigger .timer units on timezone and clock changes
Fixes: #6228
2019-04-02 08:20:10 +02:00
Lennart Poettering 25b1d72dcc bus-unit-util: split out code that shows a unit's process tree
The code is complex enough to deserve its own .c file. Let's split this
out.
2019-03-13 17:41:41 +01:00
Lennart Poettering e45c81b8bc shared: split out code to wait for jobs to complet into its own source file
It's complex enough and quite a few functions. Let's hence split this
out.

No code change, just some rearranging of source files.
2019-03-13 17:39:24 +01:00
Lennart Poettering db7091dca2 bus-unit-util: insist on full initialization 2019-03-13 17:38:43 +01:00
Lennart Poettering 2cdb2c2dc3 bus-unit-util: never call into log_job_error_with_service_result() if we are not a service
The call can't handle non-services, hence don't bother.
2019-03-13 17:38:43 +01:00
Lennart Poettering 61e209eb3a bus-unit-util: move explanations array to inner scope
It's specific to service units, hence let's minimize the scope since it
has no validity outside of the log message generation for service units.
2019-03-13 17:38:43 +01:00
Lennart Poettering 190c22189d bus-unit-util: use structure initialization 2019-03-13 17:38:43 +01:00
Lennart Poettering 1c070ea086 bus-unit-util: use free_and_strdup() where we can 2019-03-13 17:38:43 +01:00
Lennart Poettering 760877e90c util: split out sorting related calls to new sort-util.[ch] 2019-03-13 12:16:43 +01:00
Lennart Poettering 4ad9fb38a9 run: make sure NetworkNamespacePath= can be used on the systemd-run cmdline 2019-03-07 17:47:29 +01:00
Lennart Poettering eb5149ba74
Merge pull request #11682 from topimiettinen/private-utsname
core: ProtectHostname feature
2019-02-20 14:12:15 +01:00
Topi Miettinen aecd5ac621 core: ProtectHostname= feature
Let services use a private UTS namespace. In addition, a seccomp filter is
installed on set{host,domain}name and a ro bind mounts on
/proc/sys/kernel/{host,domain}name.
2019-02-20 10:50:44 +02:00
Filipe Brandenburger 10f2864111 core: add CPUQuotaPeriodSec=
This new setting allows configuration of CFS period on the CPU cgroup, instead
of using a hardcoded default of 100ms.

Tested:
- Legacy cgroup + Unified cgroup
- systemctl set-property
- systemctl show
- Confirmed that the cgroup settings (such as cpu.cfs_period_ns) were set
  appropriately, including updating the CPU quota (cpu.cfs_quota_ns) when
  CPUQuotaPeriodSec= is updated.
- Checked that clamping works properly when either period or (quota * period)
  are below the resolution of 1ms, or if period is above the max of 1s.
2019-02-14 11:04:42 -08:00
Topi Miettinen a1e92eee3e Remove 'inline' attributes from static functions in .c files (#11426)
Let the compiler perform inlining (see #11397).
2019-01-15 08:12:28 +01:00
Lennart Poettering 9a6f746fb6 locale-util: prefix special glyph enum values with SPECIAL_GLYPH_
This has been irritating me for quite a while: let's prefix these enum
values with a common prefix, like we do for almost all other enums.

No change in behaviour, just some renaming.
2018-12-14 08:22:54 +01:00
Yu Watanabe 893829359a nsflsgs: drop missing.h and use missing_sched.h 2018-12-06 13:31:16 +01:00
Yu Watanabe d2b42d63c4 core,run: make SocketProtocol= accept protocol name in upper case an protocol number 2018-12-02 06:13:47 +01:00
Yu Watanabe da96ad5ae2 util: rename socket_protocol_{from,to}_name() to ip_protocol_{from,to}_name() 2018-12-02 05:48:27 +01:00
Zbigniew Jędrzejewski-Szmek 049af8ad0c Split out part of mount-util.c into mountpoint-util.c
The idea is that anything which is related to actually manipulating mounts is
in mount-util.c, but functions for mountpoint introspection are moved to the
new file. Anything which requires libmount must be in mount-util.c.

This was supposed to be a preparation for further changes, with no functional
difference, but it results in a significant change in linkage:

$ ldd build/libnss_*.so.2
(before)
build/libnss_myhostname.so.2:
	linux-vdso.so.1 (0x00007fff77bf5000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f4bbb7b2000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007f4bbb755000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4bbb734000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f4bbb56e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f4bbb8c1000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f4bbb51b000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f4bbb512000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f4bbb4e3000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f4bbb45e000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f4bbb458000)
build/libnss_mymachines.so.2:
	linux-vdso.so.1 (0x00007ffc19cc0000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fdecb74b000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fdecb744000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007fdecb6e7000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdecb6c6000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fdecb500000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fdecb8a9000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fdecb4ad000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fdecb4a2000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fdecb475000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007fdecb3f0000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fdecb3ea000)
build/libnss_resolve.so.2:
	linux-vdso.so.1 (0x00007ffe8ef8e000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fcf314bd000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fcf314b6000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007fcf31459000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fcf31438000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fcf31272000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fcf31615000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fcf3121f000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fcf31214000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fcf311e7000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007fcf31162000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fcf3115c000)
build/libnss_systemd.so.2:
	linux-vdso.so.1 (0x00007ffda6d17000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f610b83c000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007f610b835000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007f610b7d8000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f610b7b7000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f610b5f1000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f610b995000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f610b59e000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f610b593000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f610b566000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f610b4e1000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f610b4db000)

(after)
build/libnss_myhostname.so.2:
	linux-vdso.so.1 (0x00007fff0b5e2000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fde0c328000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fde0c307000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fde0c141000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fde0c435000)
build/libnss_mymachines.so.2:
	linux-vdso.so.1 (0x00007ffdc30a7000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f06ecabb000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007f06ecab4000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f06eca93000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f06ec8cd000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f06ecc15000)
build/libnss_resolve.so.2:
	linux-vdso.so.1 (0x00007ffe95747000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fa56a80f000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fa56a808000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa56a7e7000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fa56a621000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fa56a964000)
build/libnss_systemd.so.2:
	linux-vdso.so.1 (0x00007ffe67b51000)
	librt.so.1 => /lib64/librt.so.1 (0x00007ffb32113000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007ffb3210c000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffb320eb000)
	libc.so.6 => /lib64/libc.so.6 (0x00007ffb31f25000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ffb3226a000)

I don't quite understand what is going on here, but let's not be too picky.
2018-11-29 21:03:44 +01:00
Zbigniew Jędrzejewski-Szmek 8b4e51a60e
Merge pull request #10797 from poettering/run-generator
add new "systemd-run-generator" for running arbitrary commands from the kernel command line as system services using the "systemd.run=" kernel command line switch
2018-11-28 22:40:55 +01:00
Lennart Poettering 8d33232ef1 bus-unit-util: properly accept StandardOutput=append:… settings 2018-11-27 10:06:51 +01:00
Lennart Poettering 7af67e9a8b core: allow to set exit status when using SuccessAction=/FailureAction=exit in units
This adds SuccessActionExitStatus= and FailureActionExitStatus= that may
be used to configure the exit status to propagate in when
SuccessAction=exit or FailureAction=exit is used.

When not specified let's also propagate the exit status of the main
process we fork off for the unit.
2018-11-27 09:44:40 +01:00
Zbigniew Jędrzejewski-Szmek baaa35ad70 coccinelle: make use of SYNTHETIC_ERRNO
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.

I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
2018-11-22 10:54:38 +01:00
Yu Watanabe 691d6f6d76 bus-unit-util: use streq() instead of STR_IN_SET()
Follow-up for 90fc172e19 (#10308).
2018-10-18 13:46:45 +02:00
Anita Zhang 90fc172e19 core: implement per unit journal rate limiting
Add LogRateLimitIntervalSec= and LogRateLimitBurst= options for
services. If provided, these values get passed to the journald
client context, and those values are used in the rate limiting
function in the journal over the the journald.conf values.

Part of #10230
2018-10-18 09:56:20 +02:00
Ronny Chevalier afc1feaeba bus-unit-util: fix parsing of IPAddress{Allow,Deny}
While the config parser correctly handles the case of multiple IPs,
bus_append_cgroup_property was only parsing one IP,
and it would fail with "Failed to parse IP address prefix" when given
a list of IPs.
2018-10-02 15:46:15 +02:00
Anita Zhang c87700a133 Make Watchdog Signal Configurable
Allows configuring the watchdog signal (with a default of SIGABRT).
This allows an alternative to SIGABRT when coredumps are not desirable.

Appropriate references to SIGABRT or aborting were renamed to reflect
more liberal watchdog signals.

Closes #8658
2018-09-26 16:14:29 +02:00
Yu Watanabe 93bab28895 tree-wide: use typesafe_qsort() 2018-09-19 08:02:52 +09:00
Tejun Heo 6ae4283cb1 core: add IODeviceLatencyTargetSec
This adds support for the following proposed latency based IO control
mechanism.

  https://lkml.org/lkml/2018/6/5/428
2018-08-22 16:46:18 +02:00
Zbigniew Jędrzejewski-Szmek 54fe2ce1b9
Merge pull request #9504 from poettering/nss-deadlock
some nss deadlock love
2018-07-26 10:16:25 +02:00
Lennart Poettering f806dfd345 tree-wide: increase granularity of percent specifications all over the place to permille
We so far had various placed we'd parse percentages with
parse_percent(). Let's make them use parse_permille() instead, which is
downward compatible (as it also parses percent values), and increases
the granularity a bit. Given that on the wire we usually normalize
relative specifications to something like UINT32_MAX anyway changing
from base-100 to base-1000 calculations can be done easily without
breaking compat.

This commit doesn't document this change in the man pages. While
allowing more precise specifcations permille is not as commonly
understood as perent I guess, hence let's keep this out of the docs for
now.
2018-07-25 16:14:45 +02:00
Jon Ringle fbb48d4c66 Make final kill signal configurable
Usecase is to allow changing the final kill from SIGKILL to SIGQUIT which
should create a core dump useful for debugging why the service didn't stop
with the SIGTERM
2018-07-23 13:44:54 +02:00
Lennart Poettering 79d53eb8f7 bus-unit-util: tiny coding style fix 2018-07-20 16:57:35 +02:00
Tejun Heo 4842263577 core: add MemoryMin
The kernel added support for a new cgroup memory controller knob memory.min in
bf8d5d52ffe8 ("memcg: introduce memory.min") which was merged during v4.18
merge window.

Add MemoryMin to support memory.min.
2018-07-12 08:21:43 +02:00
Ronny Chevalier 98008caa94 shared: do not include ~ when appending syscall filters property
The method already uses a boolean argument to determine whether it is in
whitelist mode or not. The code that will parse the string of filters
does not expect the ~, since it already has the boolean argument. Thus,
it will fail to parse the list of filters.
2018-06-18 13:12:20 +02:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Lennart Poettering 228af36fff core: add new PrivateMounts= unit setting
This new setting is supposed to be useful in most cases where
"MountFlags=slave" is currently used, i.e. as an explicit way to run a
service in its own mount namespace and decouple propagation from all
mounts of the new mount namespace towards the host.

The effect of MountFlags=slave and PrivateMounts=yes is mostly the same,
as both cause a CLONE_NEWNS namespace to be opened, and both will result
in all mounts within it to be mounted MS_SLAVE. The difference is mostly
on the conceptual/philosophical level: configuring the propagation mode
is nothing people should have to think about, in particular as the
matter is not precisely easyto grok. Moreover, MountFlags= allows configuration
of "private" and "slave" modes which don't really make much sense to use
in real-life and are quite confusing. In particular PrivateMounts=private means
mounts made on the host stay pinned for good by the service which is
particularly nasty for removable media mount. And PrivateMounts=shared
is in most ways a NOP when used a alone...

The main technical difference between setting only MountFlags=slave or
only PrivateMounts=yes in a unit file is that the former remounts all
mounts to MS_SLAVE and leaves them there, while that latter remounts
them to MS_SHARED again right after. The latter is generally a nicer
approach, since it disables propagation, while MS_SHARED is afterwards
in effect, which is really nice as that means further namespacing down
the tree will get MS_SHARED logic by default and we unify how
applications see our mounts as we always pass them as MS_SHARED
regardless whether any mount namespacing is used or not.

The effect of PrivateMounts=yes was implied already by all the other
mount namespacing options. With this new option we add an explicit knob
for it, to request it without any other option used as well.

See: #4393
2018-06-12 16:12:10 +02:00
Lennart Poettering b8b846d7b4 tree-wide: fix a number of log calls that use %m but have no errno set
This is mostly fall-out from d1a1f0aaf0,
however some cases are older bugs.

There might be more issues lurking, this was a simple grep for "%m"
across the tree, with all lines removed that mention "errno" at all.
2018-06-07 15:29:17 +02:00
Lennart Poettering cdc0f9be92
Merge pull request #8817 from yuwata/cleanup-nsflags
core: allow to specify RestrictNamespaces= multiple times
2018-05-24 16:49:13 +02:00
Lennart Poettering 6550c24c7f rlimit-util: rework rlimit_{from|to}_string() to work without "Limit" prefix
let's make the call more generic, so that we can also easily use it for
parsing "RLIMIT_xyz" style constants.
2018-05-17 20:36:52 +02:00
Yu Watanabe aa9d574de9 load-fragment: allow to specify RestrictNamespaces= multiple times
If multiple RestrictNamespaces= settings are set, then merge the settings.
This also drops supporting "~yes" and "~no".
2018-05-05 11:07:37 +09:00
Yu Watanabe 86c2a9f1c2 nsflsgs: drop namespace_flag_{from,to}_string()
This also drops namespace_flag_to_string_many_with_check(), and
renames namespace_flag_{from,to}_string_many() to
namespace_flags_{from,to}_string().
2018-05-05 11:07:37 +09:00
Yu Watanabe 29a3db75fd util: rename signal_from_string_try_harder() to signal_from_string()
Also this makes the new `signal_from_string()` function reject
e.g, `SIG3` or `SIG+5`.
2018-05-03 16:52:49 +09:00
Lennart Poettering d4fd1cf208 core: enforce that scope units can be started only once
Scope units are populated from PIDs specified by the bus client. We do
that when a scope is started. We really shouldn't allow scopes to be
started multiple times, as the PIDs then might be heavily out of date.
Moreover, clients should have the guarantee that any scope they allocate
has a clear runtime cycle which is not repetitive.
2018-04-27 21:52:45 +02:00
Lennart Poettering da6053d0a7 tree-wide: be more careful with the type of array sizes
Previously we were a bit sloppy with the index and size types of arrays,
we'd regularly use unsigned. While I don't think this ever resulted in
real issues I think we should be more careful there and follow a
stricter regime: unless there's a strong reason not to use size_t for
array sizes and indexes, size_t it should be. Any allocations we do
ultimately will use size_t anyway, and converting forth and back between
unsigned and size_t will always be a source of problems.

Note that on 32bit machines "unsigned" and "size_t" are equivalent, and
on 64bit machines our arrays shouldn't grow that large anyway, and if
they do we have a problem, however that kind of overly large allocation
we have protections for usually, but for overflows we do not have that
so much, hence let's add it.

So yeah, it's a story of the current code being already "good enough",
but I think some extra type hygiene is better.

This patch tries to be comprehensive, but it probably isn't and I missed
a few cases. But I guess we can cover that later as we notice it. Among
smaller fixes, this changes:

1. strv_length()' return type becomes size_t

2. the unit file changes array size becomes size_t

3. DNS answer and query array sizes become size_t

Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=76745
2018-04-27 14:29:06 +02:00
Zbigniew Jędrzejewski-Szmek 9169e4c7ba Revert "bus-unit-util: fix bus_wait_for_jobs() debug output (#8760)"
This reverts commit d6b87637c5.

Let's try a different approach.
2018-04-24 14:09:53 +02:00