Commit graph

468 commits

Author SHA1 Message Date
Lennart Poettering afcfaa695c core: implement OOMPolicy= and watch cgroups for OOM killings
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).

On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".

All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
2019-04-09 11:17:58 +02:00
Zbigniew Jędrzejewski-Szmek 58f6ab4454 pid1: pass unit name to seccomp parser when we have no file location
Building on previous commit, let's pass the unit name when parsing
dbus message or builtin whitelist, which is better than nothing.

seccomp_parse_syscall_filter() is not needed anymore, so it is removed,
and seccomp_parse_syscall_filter_full() is renamed to take its place.
2019-04-03 09:17:42 +02:00
Lennart Poettering dc44c96d97 core: pass parse error to log functions when parsing timer expressions 2019-04-01 18:25:43 +02:00
Lennart Poettering 25a04ae55e core: simply timer expression parsing by using ".ltype" field of conf-parser logic
No change of behaviour. Let's just not parse the lvalue all the time
with timer_base_from_string() if we can already pass it in parsed.
2019-04-01 18:25:43 +02:00
Zbigniew Jędrzejewski-Szmek 983616735e
Merge pull request #12137 from poettering/socket-var-run
warn about sockets in /var/run/ too
2019-03-29 15:00:25 +01:00
Lennart Poettering 4a66b5c9bf core: complain and correct /var/run/ → /run/ for listening sockets
We already do that for PIDFile= paths, and for tmpfiles.d/ snippets,
let's also do this for .socket paths.
2019-03-28 16:59:57 +01:00
Lennart Poettering 7d2c9c6b50 load-fragment: use TAKE_PTR() where we can 2019-03-28 16:46:27 +01:00
Lennart Poettering acd142af79 core: break overly long line 2019-03-28 12:09:38 +01:00
Lennart Poettering 2f6b9110fc core: parse '@default' seccomp group permissively
We are about to add system calls (rseq()) not available on old
libseccomp/old kernels, and hence we need to be permissive when parsing
our definitions.
2019-03-28 12:09:38 +01:00
Lennart Poettering d8b4d14df4 util: split out nulstr related stuff to nulstr-util.[ch] 2019-03-14 13:25:52 +01:00
Lennart Poettering eefc66aa8f util: split out some stuff into a new file limits-util.[ch] 2019-03-13 12:16:43 +01:00
Lennart Poettering 55dadc5c57 core: warn if people use the undocumented/depreacted ConditionNull=
Triggered by:

https://github.com/systemd/systemd/issues/11812
2019-03-05 13:54:20 +01:00
Anita Zhang 7ca69792e5 core: add ':' prefix to ExecXYZ= skip env var substitution 2019-02-20 17:58:14 +01:00
Filipe Brandenburger 10f2864111 core: add CPUQuotaPeriodSec=
This new setting allows configuration of CFS period on the CPU cgroup, instead
of using a hardcoded default of 100ms.

Tested:
- Legacy cgroup + Unified cgroup
- systemctl set-property
- systemctl show
- Confirmed that the cgroup settings (such as cpu.cfs_period_ns) were set
  appropriately, including updating the CPU quota (cpu.cfs_quota_ns) when
  CPUQuotaPeriodSec= is updated.
- Checked that clamping works properly when either period or (quota * period)
  are below the resolution of 1ms, or if period is above the max of 1s.
2019-02-14 11:04:42 -08:00
Yu Watanabe b8055c05e2 core/load-fragment: empty assignment to PIDFile= resets the value
Follow-up for a9353a5c5b.
2019-02-06 17:58:24 +01:00
Chris Lamb 4605de118d Correct more spelling errors. 2019-01-23 23:34:52 +01:00
Khem Raj baa162cecd core: Fix use after free case in load_from_path()
ensure that mfree() on filename is called after the logging function
which uses the string pointed by filename

Signed-off-by: Khem Raj <raj.khem@gmail.com>
2018-12-16 22:02:00 -08:00
Chris Down e92aaed30e tree-wide: Remove O_CLOEXEC from fdopen
fdopen doesn't accept "e", it's ignored. Let's not mislead people into
believing that it actually sets O_CLOEXEC.

From `man 3 fdopen`:

> e (since glibc 2.7):
> Open the file with the O_CLOEXEC flag. See open(2) for more information. This flag is ignored for fdopen()

As mentioned by @jlebon in #11131.
2018-12-12 20:47:40 +01:00
Yu Watanabe 3843e8260c missing: rename securebits.h to missing_securebits.h 2018-12-04 07:49:24 +01:00
Chris Down c72703e26d cgroup: Add DisableControllers= directive to disable controller in subtree
Some controllers (like the CPU controller) have a performance cost that
is non-trivial on certain workloads. While this can be mitigated and
improved to an extent, there will for some controllers always be some
overheads associated with the benefits gained from the controller.
Inside Facebook, the fix applied has been to disable the CPU controller
forcibly with `cgroup_disable=cpu` on the kernel command line.

This presents a problem: to disable or reenable the controller, a reboot
is required, but this is quite cumbersome and slow to do for many
thousands of machines, especially machines where disabling/enabling a
stateful service on a machine is a matter of several minutes.

Currently systemd provides some configuration knobs for these in the
form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the
like. The limitation of these is that Default*Accounting is overrideable
by individual services, of which any one could decide to reenable a
controller within the hierarchy at any point just by using a controller
feature implicitly (eg. `CPUWeight`), even if the use of that CPU
feature could just be opportunistic. Since many services are provided by
the distribution, or by upstream teams at a particular organisation,
it's not a sustainable solution to simply try to find and remove
offending directives from these units.

This commit presents a more direct solution -- a DisableControllers=
directive that forcibly disallows a controller from being enabled within
a subtree.
2018-12-03 15:40:31 +00:00
Yu Watanabe d2b42d63c4 core,run: make SocketProtocol= accept protocol name in upper case an protocol number 2018-12-02 06:13:47 +01:00
Yu Watanabe da96ad5ae2 util: rename socket_protocol_{from,to}_name() to ip_protocol_{from,to}_name() 2018-12-02 05:48:27 +01:00
Zbigniew Jędrzejewski-Szmek 049af8ad0c Split out part of mount-util.c into mountpoint-util.c
The idea is that anything which is related to actually manipulating mounts is
in mount-util.c, but functions for mountpoint introspection are moved to the
new file. Anything which requires libmount must be in mount-util.c.

This was supposed to be a preparation for further changes, with no functional
difference, but it results in a significant change in linkage:

$ ldd build/libnss_*.so.2
(before)
build/libnss_myhostname.so.2:
	linux-vdso.so.1 (0x00007fff77bf5000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f4bbb7b2000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007f4bbb755000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4bbb734000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f4bbb56e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f4bbb8c1000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f4bbb51b000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f4bbb512000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f4bbb4e3000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f4bbb45e000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f4bbb458000)
build/libnss_mymachines.so.2:
	linux-vdso.so.1 (0x00007ffc19cc0000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fdecb74b000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fdecb744000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007fdecb6e7000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdecb6c6000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fdecb500000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fdecb8a9000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fdecb4ad000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fdecb4a2000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fdecb475000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007fdecb3f0000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fdecb3ea000)
build/libnss_resolve.so.2:
	linux-vdso.so.1 (0x00007ffe8ef8e000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fcf314bd000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fcf314b6000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007fcf31459000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fcf31438000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fcf31272000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fcf31615000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fcf3121f000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fcf31214000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fcf311e7000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007fcf31162000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fcf3115c000)
build/libnss_systemd.so.2:
	linux-vdso.so.1 (0x00007ffda6d17000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f610b83c000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007f610b835000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007f610b7d8000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f610b7b7000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f610b5f1000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f610b995000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f610b59e000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f610b593000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f610b566000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f610b4e1000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f610b4db000)

(after)
build/libnss_myhostname.so.2:
	linux-vdso.so.1 (0x00007fff0b5e2000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fde0c328000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fde0c307000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fde0c141000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fde0c435000)
build/libnss_mymachines.so.2:
	linux-vdso.so.1 (0x00007ffdc30a7000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f06ecabb000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007f06ecab4000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f06eca93000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f06ec8cd000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f06ecc15000)
build/libnss_resolve.so.2:
	linux-vdso.so.1 (0x00007ffe95747000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fa56a80f000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fa56a808000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa56a7e7000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fa56a621000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fa56a964000)
build/libnss_systemd.so.2:
	linux-vdso.so.1 (0x00007ffe67b51000)
	librt.so.1 => /lib64/librt.so.1 (0x00007ffb32113000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007ffb3210c000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffb320eb000)
	libc.so.6 => /lib64/libc.so.6 (0x00007ffb31f25000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ffb3226a000)

I don't quite understand what is going on here, but let's not be too picky.
2018-11-29 21:03:44 +01:00
Zbigniew Jędrzejewski-Szmek 8b4e51a60e
Merge pull request #10797 from poettering/run-generator
add new "systemd-run-generator" for running arbitrary commands from the kernel command line as system services using the "systemd.run=" kernel command line switch
2018-11-28 22:40:55 +01:00
Yu Watanabe acf4d15893 util: make *_from_name() returns negative errno on error 2018-11-28 20:20:50 +09:00
Lennart Poettering 7af67e9a8b core: allow to set exit status when using SuccessAction=/FailureAction=exit in units
This adds SuccessActionExitStatus= and FailureActionExitStatus= that may
be used to configure the exit status to propagate in when
SuccessAction=exit or FailureAction=exit is used.

When not specified let's also propagate the exit status of the main
process we fork off for the unit.
2018-11-27 09:44:40 +01:00
Lennart Poettering 49fe5c0996 tree-wide: port various places over to STARTSWITH_SET() 2018-11-26 14:08:46 +01:00
Lennart Poettering a9353a5c5b core: log about /var/run/ prefix used in PIDFile=, patch it to be /run instead
In a way this is a follow-up for
a2d1fb882c, but adds a similar warning for
PIDFile=.

There's a much stronger case for doing this kind of notification in
tmpfiles.d (since it helps relating lines to each other for the purpose
of merging them). Doing this for PIDFile= is mostly about being
systematic and copying tmpfiles.d/ behaviour here.

While we are at it, let's also support relative filenames in PIDFile=
now, and prefix them with /run, to make them absolute.

Fixes: #10657
2018-11-10 19:17:00 +01:00
Zbigniew Jędrzejewski-Szmek 469f76f170 core: accept system mode emergency action specifiers with a warning
Before we would only accept those "system" values, so there wasn't other
chocie. Let's provide backwards compatiblity in case somebody made use of
this functionality in user mode.

v2: use 'exit-force' not 'exit'
v3: use error value in log_syntax
2018-10-17 19:31:50 +02:00
Zbigniew Jędrzejewski-Szmek 54fcb6192c core: define "exit" and "exit-force" actions for user units and only accept that
We would accept e.g. FailureAction=reboot-force in user units and then do an
exit in the user manager. Let's be stricter, and define "exit"/"exit-force" as
the only supported actions in user units.

v2:
- rename 'exit' to 'exit-force' and add new 'exit'
- add test for the parsing function
2018-10-17 19:31:49 +02:00
Yu Watanabe 958b8c7bd7 core: fix member access within null pointer
config_parse_tasks_max() is also used for parsing system.conf or
user.conf. In that case, userdata is NULL.

Fixes #10362.
2018-10-11 22:23:39 +02:00
Zbigniew Jędrzejewski-Szmek 5a72417084 pid1: drop unused path parameter to add_two_dependencies_by_name() 2018-09-15 20:02:00 +02:00
Zbigniew Jędrzejewski-Szmek 35d8c19ace pid1: drop now-unused path parameter to add_dependency_by_name() 2018-09-15 19:57:52 +02:00
Tejun Heo 6ae4283cb1 core: add IODeviceLatencyTargetSec
This adds support for the following proposed latency based IO control
mechanism.

  https://lkml.org/lkml/2018/6/5/428
2018-08-22 16:46:18 +02:00
Yu Watanabe fd870bac25 core: introduce cgroup_add_device_allow() 2018-08-06 13:42:14 +09:00
Lennart Poettering f806dfd345 tree-wide: increase granularity of percent specifications all over the place to permille
We so far had various placed we'd parse percentages with
parse_percent(). Let's make them use parse_permille() instead, which is
downward compatible (as it also parses percent values), and increases
the granularity a bit. Given that on the wire we usually normalize
relative specifications to something like UINT32_MAX anyway changing
from base-100 to base-1000 calculations can be done easily without
breaking compat.

This commit doesn't document this change in the man pages. While
allowing more precise specifcations permille is not as commonly
understood as perent I guess, hence let's keep this out of the docs for
now.
2018-07-25 16:14:45 +02:00
Zsolt Dollenstein 566b7d23eb Add support for opening files for appending
Addresses part of #8983
2018-07-20 03:54:22 -07:00
Tejun Heo 4842263577 core: add MemoryMin
The kernel added support for a new cgroup memory controller knob memory.min in
bf8d5d52ffe8 ("memcg: introduce memory.min") which was merged during v4.18
merge window.

Add MemoryMin to support memory.min.
2018-07-12 08:21:43 +02:00
Yu Watanabe 6302d1ea07 core: drop unused log message
temporary_filesystem_add() does not parse mount options.
2018-06-24 19:06:24 +02:00
YmrDtnJu a26fec2408 core: Actually use the resolved path for TemporaryFileSystem= (#9385)
The code already resolves specifiers using unit_full_printf() but then uses the
unresolved version again for temporary_filesystem_add().
2018-06-23 08:17:07 +09:00
Yu Watanabe 1e8c7bd55c namespace: drop protect_{home,system}_or_bool_from_string()
The functions protect_{home,system}_from_string() are not used
except for defining protect_{home,system}_or_bool_from_string().
This makes protect_{home,system}_from_string() support boolean
strings, and drops protect_{home,system}_or_bool_from_string().
2018-06-15 11:32:27 +02:00
Lennart Poettering 96b2fb93c5 tree-wide: beautify remaining copyright statements
Let's unify an beautify our remaining copyright statements, with a
unicode ©. This means our copyright statements are now always formatted
the same way. Yay.
2018-06-14 10:20:21 +02:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Lennart Poettering 57e84e7535 core: rework how we validate DeviceAllow= settings
Let's make sure we don't validate "char-*" and "block-*" expressions as
paths.
2018-06-11 18:01:06 +02:00
Lennart Poettering 9d5e9b4add cgroup: relax checks for block device cgroup settings
This drops needless safety checks that ensure we only reference block
devices for blockio/io settings. The backing code was already able to
accept regular file system paths too, in which case the backing device
node of that file system would be used. Hence, let's drop the artificial
restrictions and open up this underlying functionality.
2018-06-11 18:01:06 +02:00
Lennart Poettering 6f40aa4547 core: add a couple of more error cases that should result in "bad-setting"
This changes a number of EINVAL cases to ENOEXEC, so that we enter
"bad-setting" state if they fail.
2018-06-11 12:53:12 +02:00
Yu Watanabe f106314c89 conf-parser: remove redundant utf8-validity check 2018-06-04 01:38:54 +09:00
Yu Watanabe 2f4d31c117 load-fragment: use path_simplify_and_warn() where applicable 2018-06-03 23:59:42 +09:00
Yu Watanabe 858d36c1ec path-util: introduce path_simplify()
The function is similar to path_kill_slashes() but also removes
initial './', trailing '/.', and '/./' in the path.
When the second argument of path_simplify() is false, then it
behaves as the same as path_kill_slashes(). Hence, this also
replaces path_kill_slashes() with path_simplify().
2018-06-03 23:39:26 +09:00