Commit graph

250 commits

Author SHA1 Message Date
Zbigniew Jędrzejewski-Szmek a232ebcc2c core: add support for RestartKillSignal= to override signal used for restart jobs
v2:
- if RestartKillSignal= is not specified, fall back to KillSignal=. This is necessary
  to preserve backwards compatibility (and keep KillSignal= generally useful).
2019-10-02 14:01:25 +02:00
Pavel Hrdina 047f5d63d7 cgroup: introduce support for cgroup v2 CPUSET controller
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller.  This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.

The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written.  However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well.  In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
2019-09-24 15:16:07 +02:00
Zbigniew Jędrzejewski-Szmek 5ac1530eca tree-wide: say "ratelimit" not "rate_limit"
"ratelimit" is a real word, so we don't need to use the other form anywhere.
We had both forms in various places, let's standarize on the shorter and more
correct one.
2019-09-20 16:05:53 +02:00
Zbigniew Jędrzejewski-Szmek 7bf081a1e5 pid1: rename start_limit to start_ratelimit
This way it is clearer what the type is. We also have auto_stop_ratelimit adjacent,
and it feels ugly to have a different suffix for those two.
2019-09-20 16:05:53 +02:00
Zbigniew Jędrzejewski-Szmek 6b4f7fb08c
Merge pull request #13385 from yuwata/core-remove-private-directories-13355
core: also remove private directories by systemctl clean
2019-08-31 09:28:39 +02:00
Yu Watanabe 12213aed12 core: move timeout_clean_usec from Service to ExecContext 2019-08-28 23:09:54 +09:00
Zbigniew Jędrzejewski-Szmek ae480f0b09 shared/user-util: allow usernames with dots in specific fields
People do have usernames with dots, and it makes them very unhappy that systemd
doesn't like their that. It seems that there is no actual problem with allowing
dots in the username. In particular chown declares ":" as the official
separator, and internally in systemd we never rely on "." as the seperator
between user and group (nor do we call chown directly). Using dots in the name
is probably not a very good idea, but we don't need to care. Debian tools
(adduser) do not allow users with dots to be created.

This patch allows *existing* names with dots to be used in User, Group,
SupplementaryGroups, SocketUser, SocketGroup fields, both in unit files and on
the command line. DynamicUsers and sysusers still follow the strict policy.
user@.service and tmpfiles already allowed arbitrary user names, and this
remains unchanged.

Fixes #12754.
2019-08-19 21:19:13 +02:00
Anita Zhang 31cd5f63ce core: ExecCondition= for services
Closes #10596
2019-07-17 11:35:02 +02:00
Lennart Poettering 4c2f584230 core: hook up service unit type with the new clean operation
The implementation is pretty straight-foward: when we get a request to
clean some type of resources we fork off a process doing that, and while
it is running we are in the "cleaning" state.
2019-07-11 12:18:51 +02:00
Zbigniew Jędrzejewski-Szmek edfea9fe0d analyze: add 'condition' verb
We didn't have a straightforward way to parse and evaluate those strings.
Prompted by #12881.
2019-06-27 10:54:37 +02:00
Kai Lüke fab347489f bpf-firewall: custom BPF programs through IP(Ingress|Egress)FilterPath=
Takes a single /sys/fs/bpf/pinned_prog string as argument, but may be
specified multiple times. An empty assignment resets all previous filters.

Closes https://github.com/systemd/systemd/issues/10227
2019-06-25 09:56:16 +02:00
Michal Sekletar b070c7c0e1 core: introduce NUMAPolicy and NUMAMask options
Make possible to set NUMA allocation policy for manager. Manager's
policy is by default inherited to all forked off processes. However, it
is possible to override the policy on per-service basis. Currently we
support, these policies: default, prefer, bind, interleave, local.
See man 2 set_mempolicy for details on each policy.

Overall NUMA policy actually consists of two parts. Policy itself and
bitmask representing NUMA nodes where is policy effective. Node mask can
be specified using related option, NUMAMask. Default mask can be
overwritten on per-service level.
2019-06-24 16:58:54 +02:00
Chris Down 7e7223b3d5 cgroup: Readd some plumbing for DefaultMemoryMin
Somehow these got lost in the previous PR, rendering DefaultMemoryMin
not very useful.
2019-05-08 12:06:32 +01:00
Jan Klötzke dc653bf487 service: handle abort stops with dedicated timeout
When shooting down a service with SIGABRT the user might want to have a
much longer stop timeout than on regular stops/shutdowns. Especially in
the face of short stop timeouts the time might not be sufficient to
write huge core dumps before the service is killed.

This commit adds a dedicated (Default)TimeoutAbortSec= timer that is
used when stopping a service via SIGABRT. In all other cases the
existing TimeoutStopSec= is used. The timer value is unset by default
to skip the special handling and use TimeoutStopSec= for state
'stop-watchdog' to keep the old behaviour.

If the service is in state 'stop-watchdog' and the service should be
stopped explicitly we still go to 'stop-sigterm' and re-apply the usual
TimeoutStopSec= timeout.
2019-04-12 17:32:52 +02:00
Chris Down c52db42b78 cgroup: Implement default propagation of MemoryLow with DefaultMemoryLow
In cgroup v2 we have protection tunables -- currently MemoryLow and
MemoryMin (there will be more in future for other resources, too). The
design of these protection tunables requires not only intermediate
cgroups to propagate protections, but also the units at the leaf of that
resource's operation to accept it (by setting MemoryLow or MemoryMin).

This makes sense from an low-level API design perspective, but it's a
good idea to also have a higher-level abstraction that can, by default,
propagate these resources to children recursively. In this patch, this
happens by having descendants set memory.low to N if their ancestor has
DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow
value.

Any affected unit can opt out of this propagation by manually setting
`MemoryLow` to some value in its unit configuration. A unit can also
stop further propagation by setting `DefaultMemoryLow=` with no
argument. This removes further propagation in the subtree, but has no
effect on the unit itself (for that, use `MemoryLow=0`).

Our use case in production is simplifying the configuration of machines
which heavily rely on memory protection tunables, but currently require
tweaking a huge number of unit files to make that a reality. This
directive makes that significantly less fragile, and decreases the risk
of misconfiguration.

After this patch is merged, I will implement DefaultMemoryMin= using the
same principles.
2019-04-12 17:23:58 +02:00
Lennart Poettering afcfaa695c core: implement OOMPolicy= and watch cgroups for OOM killings
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).

On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".

All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
2019-04-09 11:17:58 +02:00
Davide Cavalca 639dd43a36 core: fix build failure if seccomp is disabled 2019-04-03 13:46:32 +09:00
Lennart Poettering f69567cbe2 core: expose SUID/SGID restriction as new unit setting RestrictSUIDSGID= 2019-04-02 16:56:48 +02:00
Lennart Poettering efebb613c7 core: optionally, trigger .timer units on timezone and clock changes
Fixes: #6228
2019-04-02 08:20:10 +02:00
Lennart Poettering 25a04ae55e core: simply timer expression parsing by using ".ltype" field of conf-parser logic
No change of behaviour. Let's just not parse the lvalue all the time
with timer_base_from_string() if we can already pass it in parsed.
2019-04-01 18:25:43 +02:00
Lennart Poettering a8d08f39d1 core: add new setting NetworkNamespacePath= for configuring a netns by path for a service
Fixes: #2741
2019-03-07 16:55:23 +01:00
Lennart Poettering eb5149ba74
Merge pull request #11682 from topimiettinen/private-utsname
core: ProtectHostname feature
2019-02-20 14:12:15 +01:00
Topi Miettinen aecd5ac621 core: ProtectHostname= feature
Let services use a private UTS namespace. In addition, a seccomp filter is
installed on set{host,domain}name and a ro bind mounts on
/proc/sys/kernel/{host,domain}name.
2019-02-20 10:50:44 +02:00
Filipe Brandenburger 10f2864111 core: add CPUQuotaPeriodSec=
This new setting allows configuration of CFS period on the CPU cgroup, instead
of using a hardcoded default of 100ms.

Tested:
- Legacy cgroup + Unified cgroup
- systemctl set-property
- systemctl show
- Confirmed that the cgroup settings (such as cpu.cfs_period_ns) were set
  appropriately, including updating the CPU quota (cpu.cfs_quota_ns) when
  CPUQuotaPeriodSec= is updated.
- Checked that clamping works properly when either period or (quota * period)
  are below the resolution of 1ms, or if period is above the max of 1s.
2019-02-14 11:04:42 -08:00
Chris Down c72703e26d cgroup: Add DisableControllers= directive to disable controller in subtree
Some controllers (like the CPU controller) have a performance cost that
is non-trivial on certain workloads. While this can be mitigated and
improved to an extent, there will for some controllers always be some
overheads associated with the benefits gained from the controller.
Inside Facebook, the fix applied has been to disable the CPU controller
forcibly with `cgroup_disable=cpu` on the kernel command line.

This presents a problem: to disable or reenable the controller, a reboot
is required, but this is quite cumbersome and slow to do for many
thousands of machines, especially machines where disabling/enabling a
stateful service on a machine is a matter of several minutes.

Currently systemd provides some configuration knobs for these in the
form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the
like. The limitation of these is that Default*Accounting is overrideable
by individual services, of which any one could decide to reenable a
controller within the hierarchy at any point just by using a controller
feature implicitly (eg. `CPUWeight`), even if the use of that CPU
feature could just be opportunistic. Since many services are provided by
the distribution, or by upstream teams at a particular organisation,
it's not a sustainable solution to simply try to find and remove
offending directives from these units.

This commit presents a more direct solution -- a DisableControllers=
directive that forcibly disallows a controller from being enabled within
a subtree.
2018-12-03 15:40:31 +00:00
Lennart Poettering 7af67e9a8b core: allow to set exit status when using SuccessAction=/FailureAction=exit in units
This adds SuccessActionExitStatus= and FailureActionExitStatus= that may
be used to configure the exit status to propagate in when
SuccessAction=exit or FailureAction=exit is used.

When not specified let's also propagate the exit status of the main
process we fork off for the unit.
2018-11-27 09:44:40 +01:00
Lennart Poettering a9353a5c5b core: log about /var/run/ prefix used in PIDFile=, patch it to be /run instead
In a way this is a follow-up for
a2d1fb882c, but adds a similar warning for
PIDFile=.

There's a much stronger case for doing this kind of notification in
tmpfiles.d (since it helps relating lines to each other for the purpose
of merging them). Doing this for PIDFile= is mostly about being
systematic and copying tmpfiles.d/ behaviour here.

While we are at it, let's also support relative filenames in PIDFile=
now, and prefix them with /run, to make them absolute.

Fixes: #10657
2018-11-10 19:17:00 +01:00
Anita Zhang 90fc172e19 core: implement per unit journal rate limiting
Add LogRateLimitIntervalSec= and LogRateLimitBurst= options for
services. If provided, these values get passed to the journald
client context, and those values are used in the rate limiting
function in the journal over the the journald.conf values.

Part of #10230
2018-10-18 09:56:20 +02:00
Anita Zhang c87700a133 Make Watchdog Signal Configurable
Allows configuring the watchdog signal (with a default of SIGABRT).
This allows an alternative to SIGABRT when coredumps are not desirable.

Appropriate references to SIGABRT or aborting were renamed to reflect
more liberal watchdog signals.

Closes #8658
2018-09-26 16:14:29 +02:00
Tejun Heo 6ae4283cb1 core: add IODeviceLatencyTargetSec
This adds support for the following proposed latency based IO control
mechanism.

  https://lkml.org/lkml/2018/6/5/428
2018-08-22 16:46:18 +02:00
Jon Ringle fbb48d4c66 Make final kill signal configurable
Usecase is to allow changing the final kill from SIGKILL to SIGQUIT which
should create a core dump useful for debugging why the service didn't stop
with the SIGTERM
2018-07-23 13:44:54 +02:00
Tejun Heo 4842263577 core: add MemoryMin
The kernel added support for a new cgroup memory controller knob memory.min in
bf8d5d52ffe8 ("memcg: introduce memory.min") which was merged during v4.18
merge window.

Add MemoryMin to support memory.min.
2018-07-12 08:21:43 +02:00
Lennart Poettering 228af36fff core: add new PrivateMounts= unit setting
This new setting is supposed to be useful in most cases where
"MountFlags=slave" is currently used, i.e. as an explicit way to run a
service in its own mount namespace and decouple propagation from all
mounts of the new mount namespace towards the host.

The effect of MountFlags=slave and PrivateMounts=yes is mostly the same,
as both cause a CLONE_NEWNS namespace to be opened, and both will result
in all mounts within it to be mounted MS_SLAVE. The difference is mostly
on the conceptual/philosophical level: configuring the propagation mode
is nothing people should have to think about, in particular as the
matter is not precisely easyto grok. Moreover, MountFlags= allows configuration
of "private" and "slave" modes which don't really make much sense to use
in real-life and are quite confusing. In particular PrivateMounts=private means
mounts made on the host stay pinned for good by the service which is
particularly nasty for removable media mount. And PrivateMounts=shared
is in most ways a NOP when used a alone...

The main technical difference between setting only MountFlags=slave or
only PrivateMounts=yes in a unit file is that the former remounts all
mounts to MS_SLAVE and leaves them there, while that latter remounts
them to MS_SHARED again right after. The latter is generally a nicer
approach, since it disables propagation, while MS_SHARED is afterwards
in effect, which is really nice as that means further namespacing down
the tree will get MS_SHARED logic by default and we unify how
applications see our mounts as we always pass them as MS_SHARED
regardless whether any mount namespacing is used or not.

The effect of PrivateMounts=yes was implied already by all the other
mount namespacing options. With this new option we add an explicit knob
for it, to request it without any other option used as well.

See: #4393
2018-06-12 16:12:10 +02:00
Yu Watanabe 984faf29da load-fragment: use DEFINE_CONFIG_PARSE_*() macros 2018-05-31 11:09:41 +09:00
Yu Watanabe 00463fbf0d load-fragment: make SocketProtocol= accept the empty string 2018-05-31 11:09:41 +09:00
Yu Watanabe fa65c28176 namespace: rename parse_protect_{home,system}_or_bool() to protect_{home,system}_or_bool_to_string()
Hence, we can define config_parse_protect_{home,system}() by using
DEFINE_CONFIG_PARSE_ENUM() macro.
2018-05-31 11:09:41 +09:00
Yu Watanabe b54e98ef8e socket-util: rename parse_socket_address_bind_ipv6_only_or_bool() to socket_address_bind_ipv6_only_or_bool_from_string()
Hence, we can define config_parse_socket_bind() by using
DEFINE_CONFIG_PARSE_ENUM() macro.
2018-05-31 11:09:41 +09:00
Yu Watanabe 0a9e363870 load-fragment: drop config_parse_no_new_privileges() and use config_parse_bool() instead 2018-05-31 11:09:41 +09:00
Yu Watanabe 6c58305ac3 load-fragment: use config_parse_sec_fix_0() for TimeoutStopSec= 2018-05-31 11:09:41 +09:00
Lennart Poettering 4f424df760 core: move config_parse_limit() to the generic conf-parser.[ch]
That way we can use it in nspawn.

Also, while we are at it, let's rename the call config_parse_rlimit(),
i.e. insert the "r", to clarify what kind of limit this is about.
2018-05-17 20:36:52 +02:00
Felipe Sateler 57b7a260c2 core: undo the dependency inversion between unit.h and all unit types 2018-05-15 14:24:34 -04:00
Yu Watanabe 2abd4e388a core: add new setting TemporaryFileSystem=
This introduces a new setting TemporaryFileSystem=. This is useful
to hide files not relevant to the processes invoked by unit, while
necessary files or directories can be still accessed by combining
with Bind{,ReadOnly}Paths=.
2018-02-21 09:17:52 +09:00
Yu Watanabe 8ab3934766 load-fragment: obsolete OnFailureIsolate= 2018-01-02 02:23:17 +09:00
Lennart Poettering 5022f08a23 core,udev,networkd: add ConditionKernelVersion=
This adds a simple condition/assert/match to the service manager, to
udev's .link handling and to networkd, for matching the kernel version
string.

In this version we only do fnmatch() based globbing, but we might want
to extend that to version comparisons later on, if we like, by slightly
extending the syntax with ">=", "<=", ">", "<" and "==" expressions.
2017-12-26 17:39:44 +01:00
Chris Down e16647c39d condition: Create AssertControlGroupController (#7630)
Up until now, the behaviour in systemd has (mostly) been to silently
ignore failures to action unit directives that refer to an unavailble
controller. The addition of AssertControlGroupController and its
conditional counterpart allow explicit specification of the desired
behaviour when such a situation occurs.

As for how this can happen, it is possible that a particular controller
is not available in the cgroup hierarchy. One possible reason for this
is that, in the running kernel, the controller simply doesn't exist --
for example, the CPU controller in cgroup v2 has only recently been
merged and was out of tree until then. Another possibility is that the
controller exists, but has been forcibly disabled by `cgroup_disable=`
on the kernel command line.

In future this will also support whatever comes out of issue #7624,
`DefaultXAccounting=never`, or similar.
2017-12-18 08:53:29 +01:00
Lennart Poettering b238be1e0d core: enable specifier expansion for What=/Where=/Type=/SourcePath= too
Using specifiers in these settings isn't particularly useful by itself,
but it unifies behaviour a bit. It's kinda surprising that What= in
mount units resolves specifies, but Where= does not. Hence let's add
that too. Also, it's surprising Where=/What= in mount units behaves
differently than in automount and swap units, hence resolve specifiers
there too. Then, Type= in mount units is nowadays an arbitrary,
sometimes non-trivial string (think fuse!), hence let's also expand
specifiers there, to match the rest of the mount settings.

This has the benefit that when writing code that generates unit files,
less care has to be taken to check whether escaping of specifiers is
necessary or not: broadly everything that takes arbitrary user strings
now does specifier expansion, while enums/numerics/booleans do not.
2017-11-29 12:32:57 +01:00
Lennart Poettering 613613f1ee core: use config_parse_unit_string_printf() for decoding RebootArgument=
All other cases where we accept a reboot argument are decoded with
config_parse_unit_string_printf() rather than
config_parse_unit_path_printf(), and that's really the only thing what
makes sense here, hence adjust this here, too.
2017-11-29 12:32:56 +01:00
Zbigniew Jędrzejewski-Szmek 82a27ba821
Merge pull request #7389 from shawnl/warning
tree-wide: adjust fall through comments so that gcc is happy
2017-11-22 07:38:51 +01:00
Shawn Landden 4831981d89 tree-wide: adjust fall through comments so that gcc is happy
Distcc removes comments, making the comment silencing
not work.

I know there was a decision against a macro in commit
ec251fe7d5
2017-11-20 13:06:25 -08:00
Lennart Poettering e7dfbb4e74 core: introduce SuccessAction= as unit file property
SuccessAction= is similar to FailureAction= but declares what to do on
success of a unit, rather than on failure. This is useful for running
commands in qemu/nspawn images, that shall power down on completion. We
frequently see "ExecStopPost=/usr/bin/systemctl poweroff" or so in unit
files like this. Offer a simple, more declarative alternative for this.

While we are at it, hook up failure action with unit_dump() and
transient units too.
2017-11-20 16:37:22 +01:00