Commit Graph

279 Commits

Author SHA1 Message Date
Kristijan Gjoshev acf24a1a84 timer: add new feature FixedRandomDelay=
FixedRandomDelay=yes will use
`siphash24(sd_id128_get_machine() || MANAGER_IS_SYSTEM(m) || getuid() || u->id)`,
where || is concatenation, instead of a random number to choose a value between
0 and RandomizedDelaySec= as the timer delay.
This essentially sets up a fixed, but seemingly random, offset for each timer
iteration rather than having a random offset recalculated each time it fires.

Closes #10355

Co-author: Anita Zhang <the.anitazha@gmail.com>
2020-11-05 10:59:33 -08:00
Lennart Poettering 9b1915256c core: add Timestamping= option for socket units
This adds a way to control SO_TIMESTAMP/SO_TIMESTAMPNS socket options
for sockets PID 1 binds to.

This is useful in journald so that we get proper timestamps even for
ingress log messages that are submitted before journald is running.

We recently turned on packet info metadata from PID 1 for these sockets,
but the timestamping info was still missing. Let's correct that.
2020-10-27 14:12:39 +01:00
Anita Zhang f561e8c659 core: move where we send unit change updates to oomd
Post-merge suggestion from #15206
2020-10-19 02:46:07 -07:00
Anita Zhang 620ed14e44 core: reindent and align table in load-fragment-gperf.gperf.m4 2020-10-19 02:46:07 -07:00
Anita Zhang e30bbc90c9 core: add varlink call to get cgroup paths of units using ManagedOOM*= 2020-10-07 16:17:23 -07:00
Anita Zhang 4d824a4e0b core: add ManagedOOM*= properties to configure systemd-oomd on the unit
This adds the hook ups so it can be read with the usual systemd
utilities. Used in later commits by sytemd-oomd.
2020-10-07 16:17:23 -07:00
Topi Miettinen 9df2cdd8ec exec: SystemCallLog= directive
With new directive SystemCallLog= it's possible to list system calls to be
logged. This can be used for auditing or temporarily when constructing system
call filters.

---
v5: drop intermediary, update HASHMAP_FOREACH_KEY() use
v4: skip useless debug messages, actually parse directive
v3: don't declare unused variables with old libseccomp
v2: fix build without seccomp or old libseccomp
2020-09-15 12:54:17 +03:00
Renaud Métrich 3e5f04bf64 socket: New option 'FlushPending' (boolean) to flush socket before entering listening state
Disabled by default. When Enabled, before listening on the socket, flush the content.
Applies when Accept=no only.
2020-09-01 17:20:23 +02:00
Lennart Poettering bb0c0d6f29 core: add credentials logic
Fixes: #15778 #16060
2020-08-25 19:45:35 +02:00
Lennart Poettering f053c9477b core: drop redundant comment
Since 625a164069 we don't need to update
analyze-condition.c separately anymore, hence drop the comment
suggesting otherwise.
2020-08-25 07:47:50 +02:00
Lennart Poettering 4e39995371 core: introduce ProtectProc= and ProcSubset= to expose hidepid= and subset= procfs mount options
Kernel 5.8 gained a hidepid= implementation that is truly per procfs,
which allows us to mount a distinct once into every unit, with
individual hidepid= settings. Let's expose this via two new settings:
ProtectProc= (wrapping hidpid=) and ProcSubset= (wrapping subset=).

Replaces: #11670
2020-08-24 20:11:02 +02:00
Lennart Poettering 476cfe626d core: remove support for ConditionNull=
The concept is flawed, and mostly useless. Let's finally remove it.

It has been deprecated since 90a2ec10f2 (6
years ago) and we started to warn since
55dadc5c57 (1.5 years ago).

Let's get rid of it altogether.
2020-08-20 14:01:25 +02:00
Lennart Poettering 4f55a5b0bf core: add missing conditions/asserts to unit file parsing 2020-08-20 13:56:14 +02:00
Luca Boccassi b3d133148e core: new feature MountImages
Follows the same pattern and features as RootImage, but allows an
arbitrary mount point under / to be specified by the user, and
multiple values - like BindPaths.

Original implementation by @topimiettinen at:
https://github.com/systemd/systemd/pull/14451
Reworked to use dissect's logic instead of bare libmount() calls
and other review comments.
Thanks Topi for the initial work to come up with and implement
this useful feature.
2020-08-05 21:34:55 +01:00
Luca Boccassi 18d7370587 service: add new RootImageOptions feature
Allows to specify mount options for RootImage.
In case of multi-partition images, the partition number can be prefixed
followed by colon. Eg:

RootImageOptions=1:ro,dev 2:nosuid nodev

In absence of a partition number, 0 is assumed.
2020-07-29 17:17:32 +01:00
Luca Boccassi d4d55b0d13 core: add RootHashSignature service parameter
Allow to explicitly pass root hash signature as a unit option. Takes precedence
over implicit checks.
2020-06-25 08:45:21 +01:00
Luca Boccassi 0389f4fa81 core: add RootHash and RootVerity service parameters
Allow to explicitly pass root hash (explicitly or as a file) and verity
device/file as unit options. Take precedence over implicit checks.
2020-06-23 10:50:09 +02:00
Jan Klötzke bf76080180 core: let user define start-/stop-timeout behaviour
The usual behaviour when a timeout expires is to terminate/kill the
service. This is what user usually want in production systems. To debug
services that fail to start/stop (especially sporadic failures) it
might be necessary to trigger the watchdog machinery and write core
dumps, though. Likewise, it is usually just a waste of time to
gracefully stop a stuck service. Instead it might save time to go
directly into kill mode.

This commit adds two new options to services: TimeoutStartFailureMode=
and TimeoutStopFailureMode=. Both take the same values and tweak the
behavior of systemd when a start/stop timeout expires:

 * 'terminate': is the default behaviour as it has always been,
 * 'abort': triggers the watchdog machinery and will send SIGABRT
   (unless WatchdogSignal was changed) and
 * 'kill' will directly send SIGKILL.

To handle the stop failure mode in stop-post state too a new
final-watchdog state needs to be introduced.
2020-06-09 10:04:57 +02:00
Lennart Poettering a3d19f5d99 core: add new PassPacketInfo= socket unit property 2020-05-27 22:40:38 +02:00
Martin Hundebøll c600357ba6 mount: add ReadWriteOnly property to fail on read-only mounts
Systems where a mount point is expected to be read-write needs a way to
fail mount units that fallback as read-only.

Add a property to allow setting the -w option when calling mount(8).
2020-05-01 13:23:30 +02:00
Zbigniew Jędrzejewski-Szmek ad21e542b2 manager: add CoredumpFilter= setting
Fixes #6685.
2020-04-09 14:08:48 +02:00
Lennart Poettering 91dd5f7cbe core: add new LogNamespace= execution setting 2020-01-31 15:01:43 +01:00
Kevin Kuehler fc64760dda core: shared: Add ProtectClock= to systemd.exec 2020-01-26 12:23:33 -08:00
Lennart Poettering eb34a981d6 core: initialize priority_set when parsing swap unit files
Fixes: #14524
2020-01-09 17:08:31 +01:00
Zbigniew Jędrzejewski-Szmek 0b8d307587 pid1: fix the names of AllowedCPUs= and AllowedMemoryNodes=
The original PR was submitted with CPUSetCpus and CPUSetMems, which was later
changed to AllowedCPUs and AllowedMemmoryNodes everywhere (including the parser
used by systemd-run), but not in the parser for unit files.

Since we already released -rc1, let's keep support for the old names. I think
we can remove it in a release or two if anyone remembers to do that.

Fixes #14126. Follow-up for 047f5d63d7.
2019-11-25 14:02:14 +01:00
Kevin Kuehler 8470304018 core: Add ProtectKernelLogs
If seccomp is enabled, load the SYSCALL_FILTER_SET_SYSLOG into the
seccomp filter set. Drop the CAP_SYSLOG capability.
2019-11-11 12:12:02 -08:00
Yu Watanabe f5947a5e92 tree-wide: drop missing.h 2019-10-31 17:57:03 +09:00
Zbigniew Jędrzejewski-Szmek a5f6f346d3
Merge pull request #13423 from pwithnall/12035-session-time-limits
Add `RuntimeMaxSec=` support to scope units (time-limited login sessions)
2019-10-28 14:57:00 +01:00
Philip Withnall 9ed7de605d scope: Support RuntimeMaxSec= directive in scope units
Just as `RuntimeMaxSec=` is supported for service units, add support for
it to scope units. This will gracefully kill a scope after the timeout
expires from the moment the scope enters the running state.

This could be used for time-limited login sessions, for example.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Fixes: #12035
2019-10-28 09:44:31 +01:00
Zbigniew Jędrzejewski-Szmek a232ebcc2c core: add support for RestartKillSignal= to override signal used for restart jobs
v2:
- if RestartKillSignal= is not specified, fall back to KillSignal=. This is necessary
  to preserve backwards compatibility (and keep KillSignal= generally useful).
2019-10-02 14:01:25 +02:00
Pavel Hrdina 047f5d63d7 cgroup: introduce support for cgroup v2 CPUSET controller
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller.  This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.

The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written.  However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well.  In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
2019-09-24 15:16:07 +02:00
Zbigniew Jędrzejewski-Szmek 5ac1530eca tree-wide: say "ratelimit" not "rate_limit"
"ratelimit" is a real word, so we don't need to use the other form anywhere.
We had both forms in various places, let's standarize on the shorter and more
correct one.
2019-09-20 16:05:53 +02:00
Zbigniew Jędrzejewski-Szmek 7bf081a1e5 pid1: rename start_limit to start_ratelimit
This way it is clearer what the type is. We also have auto_stop_ratelimit adjacent,
and it feels ugly to have a different suffix for those two.
2019-09-20 16:05:53 +02:00
Zbigniew Jędrzejewski-Szmek 6b4f7fb08c
Merge pull request #13385 from yuwata/core-remove-private-directories-13355
core: also remove private directories by systemctl clean
2019-08-31 09:28:39 +02:00
Yu Watanabe 12213aed12 core: move timeout_clean_usec from Service to ExecContext 2019-08-28 23:09:54 +09:00
Zbigniew Jędrzejewski-Szmek ae480f0b09 shared/user-util: allow usernames with dots in specific fields
People do have usernames with dots, and it makes them very unhappy that systemd
doesn't like their that. It seems that there is no actual problem with allowing
dots in the username. In particular chown declares ":" as the official
separator, and internally in systemd we never rely on "." as the seperator
between user and group (nor do we call chown directly). Using dots in the name
is probably not a very good idea, but we don't need to care. Debian tools
(adduser) do not allow users with dots to be created.

This patch allows *existing* names with dots to be used in User, Group,
SupplementaryGroups, SocketUser, SocketGroup fields, both in unit files and on
the command line. DynamicUsers and sysusers still follow the strict policy.
user@.service and tmpfiles already allowed arbitrary user names, and this
remains unchanged.

Fixes #12754.
2019-08-19 21:19:13 +02:00
Anita Zhang 31cd5f63ce core: ExecCondition= for services
Closes #10596
2019-07-17 11:35:02 +02:00
Lennart Poettering 4c2f584230 core: hook up service unit type with the new clean operation
The implementation is pretty straight-foward: when we get a request to
clean some type of resources we fork off a process doing that, and while
it is running we are in the "cleaning" state.
2019-07-11 12:18:51 +02:00
Zbigniew Jędrzejewski-Szmek edfea9fe0d analyze: add 'condition' verb
We didn't have a straightforward way to parse and evaluate those strings.
Prompted by #12881.
2019-06-27 10:54:37 +02:00
Kai Lüke fab347489f bpf-firewall: custom BPF programs through IP(Ingress|Egress)FilterPath=
Takes a single /sys/fs/bpf/pinned_prog string as argument, but may be
specified multiple times. An empty assignment resets all previous filters.

Closes https://github.com/systemd/systemd/issues/10227
2019-06-25 09:56:16 +02:00
Michal Sekletar b070c7c0e1 core: introduce NUMAPolicy and NUMAMask options
Make possible to set NUMA allocation policy for manager. Manager's
policy is by default inherited to all forked off processes. However, it
is possible to override the policy on per-service basis. Currently we
support, these policies: default, prefer, bind, interleave, local.
See man 2 set_mempolicy for details on each policy.

Overall NUMA policy actually consists of two parts. Policy itself and
bitmask representing NUMA nodes where is policy effective. Node mask can
be specified using related option, NUMAMask. Default mask can be
overwritten on per-service level.
2019-06-24 16:58:54 +02:00
Chris Down 7e7223b3d5 cgroup: Readd some plumbing for DefaultMemoryMin
Somehow these got lost in the previous PR, rendering DefaultMemoryMin
not very useful.
2019-05-08 12:06:32 +01:00
Jan Klötzke dc653bf487 service: handle abort stops with dedicated timeout
When shooting down a service with SIGABRT the user might want to have a
much longer stop timeout than on regular stops/shutdowns. Especially in
the face of short stop timeouts the time might not be sufficient to
write huge core dumps before the service is killed.

This commit adds a dedicated (Default)TimeoutAbortSec= timer that is
used when stopping a service via SIGABRT. In all other cases the
existing TimeoutStopSec= is used. The timer value is unset by default
to skip the special handling and use TimeoutStopSec= for state
'stop-watchdog' to keep the old behaviour.

If the service is in state 'stop-watchdog' and the service should be
stopped explicitly we still go to 'stop-sigterm' and re-apply the usual
TimeoutStopSec= timeout.
2019-04-12 17:32:52 +02:00
Chris Down c52db42b78 cgroup: Implement default propagation of MemoryLow with DefaultMemoryLow
In cgroup v2 we have protection tunables -- currently MemoryLow and
MemoryMin (there will be more in future for other resources, too). The
design of these protection tunables requires not only intermediate
cgroups to propagate protections, but also the units at the leaf of that
resource's operation to accept it (by setting MemoryLow or MemoryMin).

This makes sense from an low-level API design perspective, but it's a
good idea to also have a higher-level abstraction that can, by default,
propagate these resources to children recursively. In this patch, this
happens by having descendants set memory.low to N if their ancestor has
DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow
value.

Any affected unit can opt out of this propagation by manually setting
`MemoryLow` to some value in its unit configuration. A unit can also
stop further propagation by setting `DefaultMemoryLow=` with no
argument. This removes further propagation in the subtree, but has no
effect on the unit itself (for that, use `MemoryLow=0`).

Our use case in production is simplifying the configuration of machines
which heavily rely on memory protection tunables, but currently require
tweaking a huge number of unit files to make that a reality. This
directive makes that significantly less fragile, and decreases the risk
of misconfiguration.

After this patch is merged, I will implement DefaultMemoryMin= using the
same principles.
2019-04-12 17:23:58 +02:00
Lennart Poettering afcfaa695c core: implement OOMPolicy= and watch cgroups for OOM killings
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).

On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".

All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
2019-04-09 11:17:58 +02:00
Davide Cavalca 639dd43a36 core: fix build failure if seccomp is disabled 2019-04-03 13:46:32 +09:00
Lennart Poettering f69567cbe2 core: expose SUID/SGID restriction as new unit setting RestrictSUIDSGID= 2019-04-02 16:56:48 +02:00
Lennart Poettering efebb613c7 core: optionally, trigger .timer units on timezone and clock changes
Fixes: #6228
2019-04-02 08:20:10 +02:00
Lennart Poettering 25a04ae55e core: simply timer expression parsing by using ".ltype" field of conf-parser logic
No change of behaviour. Let's just not parse the lvalue all the time
with timer_base_from_string() if we can already pass it in parsed.
2019-04-01 18:25:43 +02:00
Lennart Poettering a8d08f39d1 core: add new setting NetworkNamespacePath= for configuring a netns by path for a service
Fixes: #2741
2019-03-07 16:55:23 +01:00