From the bug:
> According to the documentation of systemd.automount if the automoint point is
> automagically created if it doesn't exist yet. This ofcourse means the
> filesystem underneath has to be writable, which for / means not only does
> -.mount need to be started but also systemd-remount-fs.service has to be run,
> which isn't guaranteed by the default automount dependencies.
>
> For .mount units there is an automatic default After= dependency on
> local-fs-pre.target, would probably make sense to do the same for automount
> units to avoid it failing on the corner-case where it has to create directory.
Fixes#13306.
Just as `RuntimeMaxSec=` is supported for service units, add support for
it to scope units. This will gracefully kill a scope after the timeout
expires from the moment the scope enters the running state.
This could be used for time-limited login sessions, for example.
Signed-off-by: Philip Withnall <withnall@endlessm.com>
Fixes: #12035
chase_symlinks() would return negative on error, and either a non-negative status
or a non-negative fd when CHASE_OPEN was given. This made the interface quite
complicated, because dependning on the flags used, we would get two different
"types" of return object. Coverity was always confused by this, and flagged
every use of chase_symlinks() without CHASE_OPEN as a resource leak (because it
would this that an fd is returned). This patch uses a saparate output parameter,
so there is no confusion.
(I think it is OK to have functions which return either an error or an fd. It's
only returning *either* an fd or a non-fd that is confusing.)
This makes it easy to tell that the function only uses the Unit* for
reporting, and only makes changes to the other argument (which most likely
also points at the same Unit structure) for modifications.
No functional change, just adjusting code to follow the same pattern
everywhere. In particular, never call _verify() on an already loaded unit,
but return early from the caller instead. This makes the code a bit easier
to follow.
Now all unit types define .load. But even if it wasn't defined, we'd need
to call unit_load_fragment_and_dropin() anyway, so this code would not have
worked correctly.
Also, unit_load_fragment_and_dropin() either returns -ENOENT or changes
UNIT_STUB to UNIT_LOADED, so we don't need to repeat this here.
There is a slight functional change when load_state == UNIT_MERGED. Before,
we would not call unit_load_dropin(), but now we do. I'm not sure if this
causes an actual difference in behaviour, but since all other unit types do
this, I think it's better to do the same thing here too.
unit_load_fragment_and_dropin() and unit_load_fragment_and_dropin_optional()
are really the same, with one minor difference in behaviour. Let's drop
the second function.
"_optional" in the name suggests that it's the "dropin" part that is optional.
(Which it is, but in this case, we mean the fragment to be optional.)
I think the new version with a flag is easier to understand.
The default message for ENOSPC is very misleading: it says that the disk is
filled, but in fact the inotify watch limit is the problem.
So let's introduce and use a wrapper that simply calls inotify_add_watch(2) and
which fixes the error message up in case ENOSPC is returned.
PID1 may modified the environment passed by the kernel when it starts
running. Commit 9d48671c62 unset $HOME for
example.
In case PID1 is going to switch to a new root and execute a new system manager
which is not systemd, we should restore the original environment as the new
manager might expect some variables to be set by default (more specifically
$HOME).
This is the most basic consumer of the new systemd-vs-kernel checker,
both acting as a reasonable standalone exerciser of the code, and also
as a way for easy inspection of deviations from systemd internal state.
We currently don't have any mitigations against another privileged user
on the system messing with the cgroup hierarchy, bringing the system out
of line with what we've set in systemd. We also don't have any real way
to surface this to the user (we do have logs, but you have to know to
look in the first place).
There are a few possible solutions:
1. Maintaining our own cgroup tree with the new fsopen API and having a
read-only copy for everyone else. However, there are some
complications on this front, and this may be infeasible in some
environments. I'd rate this as a longer term effort that's tangential
to this patch.
2. Actively checking for changes with {fa,i}notify and changing them
back afterwards to match our configuration again. This is also
possible, but it's also good to have a way to do passive monitoring
of the situation without taking hard action. Also, currently daemons
like senpai do actually need to modify the tree behind systemd's
back (although hopefully this should be more integrated soon).
This patch implements another option, where one can, on demand, monitor
deviations in cgroup memory configuration from systemd's internal state.
Currently the only consumer is `systemd-analyze dump`, but the interface
is generic enough that it can also be exposed elsewhere later (for
example, over D-Bus).
Currently only memory limit style properties are supported, but later I
also plan to expand this out to other properties that systemd should
have ultimate control over.
mkdir -p is called both when setting up the autofs mount, as well
as after being notified that the real mount unit should be called.
However the first mkdir -p is hardcoded with 0555, while the second
uses the value specified to DirectoryMode in the automount unit; the
second mkdir -p is only needed when called from coldplug, so under
normal operation the dirs are incorrectly created with mode 0555.
This replaces the hardcoded 0555 mode with the value of DirectoryMode.
Closes#13683.
Setting the log level based on the signal made sense when signals that
were used were fixed. Since we allow signals to be configured, it doesn't
make sense to log at notice level about e.g. a restart or stop operation
just because the signal used is different.
This avoids messages like:
six.service: Killing process 210356 (sleep) with signal SIGINT.
v2:
- if RestartKillSignal= is not specified, fall back to KillSignal=. This is necessary
to preserve backwards compatibility (and keep KillSignal= generally useful).
A later version of the DefaultMemory{Low,Min} patch changed these to
require explicitly setting memory_foo_set, but we only set that in
load-fragment, not dbus-cgroup.
Without these, we may fall back to either DefaultMemoryFoo or
CGROUP_LIMIT_MIN when we really shouldn't.
This is an oversight from https://github.com/systemd/systemd/pull/12332.
Sadly the tests didn't catch it since it requires a real cgroup
hierarchy to see, and it wasn't seen in prod since we're only currently
using DefaultMemoryLow, not DefaultMemoryMin. :-(
As documented in the man-page, readdir() may return a directory entry with
d_type == DT_UNKNOWN. This must be handled for regular filesystems.
dirent_ensure_type() is available to set d_type if necessary. Use it in
some more places.
Without this systemd will fail to boot correctly with nfsroot and some
other filesystems.
Closes#13609
-ELOOP can happen also when enabling an alias name (which is admittedly useless
since the unit it belongs to was already enabled) so let's mention this
possibility when reporting the corresponding error.
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller. This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.
The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written. However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well. In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
We have the problem that many early boot or late shutdown issues are harder
to solve than they could be because we have no logs. When journald is not
running, messages are redirected to /dev/kmsg. It is also the time when many
things happen in a rapid succession, so we tend to hit the kernel printk
ratelimit fairly reliably. The end result is that we get no logs from the time
where they would be most useful. Thus let's disable the kernels ratelimit.
Once the system is up and running, the ratelimit is not a problem. But during
normal runtime, things also log to journald, and not to /dev/kmsg, so the
ratelimit is not useful. Hence, there doesn't seem to be much point in trying
to restore the ratelimit after boot is finished and journald is up and running.
See kernel's commit 750afe7babd117daabebf4855da18e4418ea845e for the
description of the kenrel interface. Our setting has lower precedence than
explicit configuration on the kenrel command line.
"ratelimit" is a real word, so we don't need to use the other form anywhere.
We had both forms in various places, let's standarize on the shorter and more
correct one.
The "Ex" variant was originally only added for ExecStartXYZ= but it makes
sense to have feature parity for the rest of the exec command properties
as well (e.g. ExecReload=, ExecStop=, etc).
I want to use efivars.[ch] in proc-cmdline.c, but most of the efivars stuff is
not needed in basic/. Move the file from shared/ to basic/, but then move back
most of the higher-level functions to the new shared/efi-loader.c file.
This way less stuff needs to be in basic. Initially, I wanted to move all the
parts of cgroup-utils.[ch] that depend on efivars.[ch] to shared, because
efivars.[ch] is in shared/. Later on, I decide to split efivars.[ch], so the
move done in this patch is not necessary anymore. Nevertheless, it is still
valid on its own. If at some point we want to expose libbasic, it is better to
to not have stuff that belong in libshared there.
This avoid the use of the global variable.
Also rename cgroup_unified_update() to cgroup_unified_cached() and
cgroup_unified_flush() to cgroup_unified() to better reflect their new roles.
During the rework of unit file loading, commit e8630e6952 dropped the
initialization u->source_mtime. This had the bad side effect that generated
units always needed daemon reloading.
This function had two users (apart from tests), and both only used one
argument. And it seems likely that if we need to pass more directories,
either the _nulstr() or the _strv() form would be used. Let's simplify
the code.
relabel_extra() relabels the descendants of directories listed in
relabel-extra.d, but doesn't relabel the files or directories
explicitly named there. This makes it impossible to use
relabel-extra.d to relabel the root of a filesystem. Fix by
relabeling the named items too.
This fixes the following problem:
> At the very end of the boot, just after the first user logs in
> (usually using sddm / X) I get the following messages in my logs:
> Nov 18 07:02:33 samd dbus-daemon[2879]: [session uid=1000 pid=2877] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
> Nov 18 07:02:33 samd dbus-daemon[2879]: [session uid=1000 pid=2877] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
These messages are caused by the "stub" service files that systemd
installs. It installed them because early versions of systemd activation
required them to exist.
Since dbus 1.11.0, a dbus-daemon that is run with --systemd-activation
automatically assumes that o.fd.systemd1 is an activatable
service. As a result, with a new enough dbus version,
/usr/share/dbus-1/services/org.freedesktop.systemd1.service and
/usr/share/dbus-1/system-services/org.freedesktop.systemd1.service should
become unnecessary, and they can be removed.
dbus 1.11.0 was released 2015-12-02.
Bug-Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=914015
This reverts commit 2de9b9793b.
This check causes regressions, in particular our own units fail. Apparently, it
is enough for the unit to be referenced enough times:
$ journalctl -b -u systemd-ask-password-console.path
Aug 30 12:08:14 krowka systemd[1]: Condition check resulted in Dispatch Password Requests to Console Directory Watch being skipped.
Aug 30 12:08:33 krowka systemd[1]: Condition check resulted in Dispatch Password Requests to Console Directory Watch being skipped.
Aug 30 12:08:33 krowka systemd[1]: Condition check resulted in Dispatch Password Requests to Console Directory Watch being skipped.
Aug 30 12:08:33 krowka systemd[1]: Condition check resulted in Dispatch Password Requests to Console Directory Watch being skipped.
Aug 30 12:08:33 krowka systemd[1]: Condition check resulted in Dispatch Password Requests to Console Directory Watch being skipped.
Aug 30 12:08:33 krowka systemd[1]: systemd-ask-password-console.path: Start request repeated too quickly.
Aug 30 12:08:33 krowka systemd[1]: Failed to start Dispatch Password Requests to Console Directory Watch.
$ journalctl -b -u systemd-firstboot.service
-- Logs begin at Sun 2019-04-21 12:39:21 CEST, end at Fri 2019-08-30 12:23:06 CEST. --
Aug 30 12:08:33 krowka systemd[1]: Condition check resulted in First Boot Wizard being skipped.
Aug 30 12:08:33 krowka systemd[1]: Condition check resulted in First Boot Wizard being skipped.
Aug 30 12:08:33 krowka systemd[1]: Condition check resulted in First Boot Wizard being skipped.
Aug 30 12:08:33 krowka systemd[1]: Condition check resulted in First Boot Wizard being skipped.
Aug 30 12:08:33 krowka systemd[1]: systemd-firstboot.service: Start request repeated too quickly.
Aug 30 12:08:33 krowka systemd[1]: Failed to start First Boot Wizard.
And the same for other units.
Fixes#13434.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=935829
Traditionally, user logins had a $PATH in which /bin was before /sbin, while
root logins had a $PATH with /sbin first. This allows the tricks that
consolehelper is doing to work. But even if we ignore consolehelper, having the
path in this order might have been used by admins for other purposes, and
keeping the order in user sessions will make it easier the adoption of systemd
user sessions a bit easier.
Fixes#733.
https://bugzilla.redhat.com/show_bug.cgi?id=1744059
OOM handling in manager_default_environment wasn't really correct.
Now the (theorertical) malloc failure in strv_new() is handled.
Please note that this has no effect on:
- systems with merged /bin-/sbin (e.g. arch)
- when there are no binaries that differ between the two locations.
E.g. on my F30 laptop there is exactly one program that is affected:
/usr/bin/setup -> consolehelper.
There is less and less stuff that relies on consolehelper, but there's still
some.
So for "clean" systems this makes no difference, but helps with legacy setups.
$ dnf repoquery --releasever=31 --qf %{name} --whatrequires usermode
anaconda-live
audit-viewer
beesu
chkrootkit
driftnet
drobo-utils-gui
hddtemp
mate-system-log
mock
pure-ftpd
setuptool
subscription-manager
system-config-httpd
system-config-rootpassword
system-switch-java
system-switch-mail
usermode-gtk
vpnc-consoleuser
wifi-radar
xawtv
When we would iterate over the lookup paths for each unit, making the list as
short as possible was important for performance. With the current cache, it
doesn't matter much. Two classes of paths were being removed:
- paths which don't exist in the filesystem
- paths which symlink to a path earlier in the search list
Both of those points cause problems with the caching code:
- if a user creates a directory that didn't exist before and puts units there,
now we will notice the new mtime an properly load the unit. When the path
was removed from list, we wouldn't.
- we now properly detect whether a unit path is on the path or not.
Before, if e.g. /lib/systemd/system, /usr/lib/systemd/systemd were both on
the path, and /lib was a symlink to /usr/lib, the second directory would be
pruned from the path. Then, the code would think that a symlink
/etc/systemd/system/foo.service→/lib/systemd/system/foo.service is an alias,
but /etc/systemd/system/foo.service→/usr/lib/systemd/system/foo.service would
be considered a link (in the systemctl link sense).
Removing the pruning has a slight negative performance impact in case of
usr-merge systems which have systemd compiled with non-usr-merge paths.
Non-usr-merge systems are deprecated, and this impact should be very small, so
I think it's OK. If it turns out to be an issue, the loop in function that
builds the cache could be improved to skip over "duplicate" directories with
same logic that the cache pruning did before. I didn't want to add this,
becuase it complicates the code to improve a corner case.
Fixes#13272.
The alternative would be to recreate the cache, but dropins can be created very
often for transient settings, so updating the cache seems like a much faster
option.
Fixes#13287.
Current kernels with BFQ scheduler do not yet set their IO weight
through "io.weight" but through "io.bfq.weight" (using a slightly
different interface supporting only default weights, not per-device
weights). This commit enables "IOWeight=" to just to that.
This patch may be dropped at some time later.
Github-Link: https://github.com/systemd/systemd/issues/7057
Signed-off-by: Kai Krakow <kai@kaishome.de>
People do have usernames with dots, and it makes them very unhappy that systemd
doesn't like their that. It seems that there is no actual problem with allowing
dots in the username. In particular chown declares ":" as the official
separator, and internally in systemd we never rely on "." as the seperator
between user and group (nor do we call chown directly). Using dots in the name
is probably not a very good idea, but we don't need to care. Debian tools
(adduser) do not allow users with dots to be created.
This patch allows *existing* names with dots to be used in User, Group,
SupplementaryGroups, SocketUser, SocketGroup fields, both in unit files and on
the command line. DynamicUsers and sysusers still follow the strict policy.
user@.service and tmpfiles already allowed arbitrary user names, and this
remains unchanged.
Fixes#12754.
In high load scenarios it is possible for services to be started
before the NameOwnerChanged signal is properly installed.
Emulate a callback by also queuing a GetNameOwner when the match is
installed.
Fixes: #12956
We explicitly create /etc/systemd/user and other parts of the basic directory
tree. I think we should create /etc/systemd/system too. (The alternative would
be to not create those other directories too, but I think it's nice to have
the basic directory structure in place after installation.)
https://bugzilla.redhat.com/show_bug.cgi?id=1737362
v2:
- do not watch mtime of transient and generated dirs
We'd reload the map after every transient unit we created, which we don't
need to do, since we create those units ourselves and know their fragment
path.
This reworks how we load units from disk. Instead of chasing symlinks every
time we are asked to load a unit by name, we slurp all symlinks from disk
and build two hashmaps:
1. from unit name to either alias target, or fragment on disk
(if an alias, we put just the target name in the hashmap, if a fragment
we put an absolute path, so we can distinguish both).
2. from a unit name to all aliases
Reading all this data can be pretty costly (40 ms) on my machine, so we keep it
around for reuse.
The advantage is that we can reliably know what all the aliases of a given unit
are. This means we can reliably load dropins under all names. This fixes#11972.
We now log at LOG_INFO for any unit. Let's vary the log level
a bit, so that for normal short lived-units (less than 1 sec CPU),
we only log if debugging is enabled.
We were passing 1/4th of the size in bytes as argument. So depending
on the size of the array, either we'd only transfer a subset of values,
or we'd get an alignment error.
I opted to embed the Bitmap structure directly in the ExitStatusSet.
This means that memory usage is a bit higher for units which don't define
this setting:
Service changes:
/* size: 2720, cachelines: 43, members: 73 */
/* sum members: 2680, holes: 9, sum holes: 39 */
/* sum bitfield members: 7 bits, bit holes: 1, sum bit holes: 1 bits */
/* last cacheline: 32 bytes */
/* size: 2816, cachelines: 44, members: 73 */
/* sum members: 2776, holes: 9, sum holes: 39 */
/* sum bitfield members: 7 bits, bit holes: 1, sum bit holes: 1 bits */
But this way the code is simpler and we do less pointer chasing.
Factor it out into a helper function which is a bit easier to expand in
future. This introduces no functional changes.
Signed-off-by: Philip Withnall <withnall@endlessm.com>
I was debugging stuff during early boot, and was confused that I never
found the logs for it in kmsg. The reason for that was that /proc is
generally not mounted the first time we do log_open() and hence
log_set_target(LOG_TARGET_KMSG) we do when running as PID 1 had not
effect. A lot later during start-up we call log_open() again where this
is fixed (after the point where we close all remaining fds still open),
but in the meantime no logs every got written to kmsg. This patch fixes
that.
The specific header file is probably not copyrightable anyway, since
it's so trivial, but let's still add the SPDX header line so that a
systematic check for the line does't spit out this header needlessly.
This option is only used on reboot, not on other types of shutdown
modes, so it is misleading.
Keep the old name working for backward compatibility, but remove it
from the documentation.
Rather than always enabling the shutdown WD on kexec, which might be
dangerous in case the kernel driver and/or the hardware implementation
does not reset the wd on kexec, add a new timer, disabled by default,
to let users optionally enable the shutdown WD on kexec separately
from the runtime and reboot ones. Advise in the documentation to
also use the runtime WD in conjunction with it.
Fixes: a637d0f9ec ("core: set shutdown watchdog on kexec too")
We can meaningfully compare jobs for units which have cpu weight or nice set.
But non-exec units those have those set.
Starting non-exec jobs first allows us to get them out of the queue quickly,
and consider more jobs for starting.
If we have service A, and socket B, and service C which is after socket B,
and we want to start both A and C, and C has higher cpu weight, if we get
B out of the way first, we'll know that we can start both A and C, and we'll
start C first.
Also invert the comparisons using CMP() so they are always done left vs. right,
and negate when returning instead.
Follow-up for da8e178296.
At the moment the shutdown watchdog is set only when rebooting.
The set of "things that can go wrong" is not too far off when kexec'ing
and in fact we have a use case where it would be useful - moving to a
new kernel image.
(The interesting bits about the what and why are in a comment in the
patch, please have a look there instead of looking here in the commit
msg).
Fixes: #10872
Our IO handler is only installed for one fd, hence there's no reason to
conditionalize on it again.
Also, split out the draining into a helper function of its own.
Jobs are added to the run queue in random order. This happens because most
jobs are added by iterating over the transaction or dependency hash maps.
As a result, jobs that can be executed at the same time are started in a
different order each time.
On small embedded devices this can cause a measurable jitter for the point
in time when a job starts (~100ms jitter for 10 units that are started in
random order).
This results is a similar jitter for the boot time. This is undesirable in
general and make optimizing the boot time a lot harder.
Also, jobs that should have a higher priority because the unit has a higher
CPU weight might get executed later than others.
Fix this by turning the job run_queue into a Prioq and sort by the
following criteria (use the next if the values are equal):
- CPU weight
- nice level
- unit type
- unit name
The last one is just there for deterministic sorting to avoid any jitter.
It had two users, but it is just a very thin wrapper around
unit_file_find_dropin_paths(), so using it seems more complicated than directly
invoking unit_file_find_dropin_paths() twice.
Commit fb39af4ce4 forgot to restore the default
rlimit values (RLIMIT_NOFILE and RLIMIT_MEMLOCK) while PID1 is reloading.
This patch extracts the code in charge of initializing the default values for
those rlimits in order to create dedicated functions, which take care of their
initialization.
These functions are then called in parse_configuration() so we make sure that
the default values for these rlimits get restored every time PID1 is reloading
its configuration.
let's add [static] where it was missing so far
Drop [static] on parameters that can be NULL.
Add an assert() around parameters that have [static] and can't be NULL
hence.
Add some "const" where it was forgotten.
timer units maintain state on disk (the persistent touch file), hence
let's expose cleaning it up generically with the new cleaning operation
for units.
This is a much simpler implementation as for the service unit type:
instead of forking out a worker process we just remove the touch file
directly. That should be OK since we only need to remove a single
(empty) file, instead of a recursive user-controlled directory tree.
Fixes: #4930
The implementation is pretty straight-foward: when we get a request to
clean some type of resources we fork off a process doing that, and while
it is running we are in the "cleaning" state.
This adds basic infrastructure to implement a "clean" operation for unit
types. This "clean" operation is supposed to remove on-disk resources of
units, and is supposed to be used in a later commit to clean our
RuntimeDirectory=, StateDirectory= and so on of service units.
Later commits will open this up to the bus, and hook up service units
with this.
This also adds a new generic ActiveState called UNIT_MAINTENANCE. It's
supposed to cover all kinds of "maintainance" state of units.
Specifically, this is supposed to cover the "cleaning" operations later
added for service units which might take a bit of time. This high-level,
generic, abstract state is called UNIT_MAINTENANCE instead of the
more specific "UNIT_CLEANING", since I think this should be kept open
for different operations possibly later on that could be nicely subsumed
under this (for example, maybe a recursive chown()ing operation could be
covered by this, and similar).
Most units have just one name, but we'd print it twice:
-> Unit systemd-sysctl.service:
...
Name: systemd-sysctl.service
Let's only print the "main" name once, and call the other names Aliases.
Fixes#12258.
This is enough to reproduce:
$ systemd-run bash -c 'sleep 10' && systemctl daemon-reload
would result in
Current command vanished from the unit file.
We would serialize as:
ExecStart 0 /usr/bin/bash /usr/bin/bash -c sleep 10000
which of course can't work.
Now we serialize as
ExecStart 0 /usr/bin/bash "/usr/bin/bash" "-c" "sleep 10".
$ systemd-analyze dump | head -3
Timestamp firmware: (null)
Timestamp loader: (null)
Timestamp kernel: Mon 2019-07-01 17:21:02 CEST
Since this is a debugging interface, it is OK to change the output format.
The user can infer what "Timestamp firmware: 123.456ms" means.
Whenever I see EXTRACT_QUOTES, I'm always confused whether it means to
leave the quotes in or to take them out. Let's say "unquote", like we
say "cunescape".
We'd skip any whitespace immediately after "=", but then we'd treat whitespace
that is between "|" or "!" and the value as significant. This is rather
confusing, let's ignore it too.
When we are validating a transaction, we take into account declared
ordering between job units. However, since JOB_STOP goes always first
regardless of the ordering constraint between respective units, we may
detect some false cycles in the transaction which would not prevent the
execution though.
Use the same logic in transaction checking as we use for job execution.
In this mode we are not supposed to "interact with the environment", so loading
all units and printing warnings about syntax errors and /var/run usage seems
inappropriate.
The job ordering logic is spread at multiple places of the code, making
it hard to maintain and also a bit to understand. The actual execution
order of two jobs always depends on their types and the ordering
contraint between their units. Extract this logic to a new function
job_compare. The second change is simplification of the order
evaluation, JOB_STOP takes always precedence (as documented), unless two
units are both stopping, then the ordering constraint is taken into
account.
Takes a single /sys/fs/bpf/pinned_prog string as argument, but may be
specified multiple times. An empty assignment resets all previous filters.
Closes https://github.com/systemd/systemd/issues/10227
Make possible to set NUMA allocation policy for manager. Manager's
policy is by default inherited to all forked off processes. However, it
is possible to override the policy on per-service basis. Currently we
support, these policies: default, prefer, bind, interleave, local.
See man 2 set_mempolicy for details on each policy.
Overall NUMA policy actually consists of two parts. Policy itself and
bitmask representing NUMA nodes where is policy effective. Node mask can
be specified using related option, NUMAMask. Default mask can be
overwritten on per-service level.
This is a workaround to make IPAddressDeny=any/IPAddressAllow=any work
for non-root users that have CAP_NET_ADMIN. "any" was chosen since
all or nothing network access is one of the most common use cases for
isolation.
Allocating BPF LPM TRIE maps require CAP_SYS_ADMIN while BPF_PROG_TYPE_CGROUP_SKB
only needs CAP_NET_ADMIN. In the case of IPAddressXYZ="any" we can just
consistently return false/true to avoid allocating the map and limit the user
to having CAP_NET_ADMIN.
prefix_root() is equivalent to path_join() in almost all ways, hence
let's remove it.
There are subtle differences though: prefix_root() will try shorten
multiple "/" before and after the prefix. path_join() doesn't do that.
This means prefix_root() might return a string shorter than both its
inputs combined, while path_join() never does that. I like the
path_join() semantics better, hence I think dropping prefix_root() is
totally OK. In the end the strings generated by both functon should
always be identical in terms of path_equal() if not streq().
This leaves prefix_roota() in place. Ideally we'd have path_joina(), but
I don't think we can reasonably implement that as a macro. or maybe we
can? (if so, sounds like something for a later PR)
Also add in a few missing OOM checks
When part of the cgroup hierarchy cannot be deleted (e.g. because there
are still processes in it), do not exit unit_prune_cgroup early, but
continue so that u->cgroup_realized is reset.
Log the known case of non-empty cgroups at debug level and other errors
at warning level.
Fixes https://github.com/systemd/systemd/issues/12386
Since kernel 5.2 the kernel thankfully returns proper errors when we
write a value out of range to the sysctl. Which however breaks writing
ULONG_MAX to request the maximum value. Hence let's write the new
maximum value instead, LONG_MAX.
/cc @brauner
Fixes: #12803
We'd get a warning on every nspawn invocation:
dev-hugepages.mount: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling.
(This warning is only shown for the first unit using IP firewalling.)
Before the previous commit, I'd generally get a warning about systemd-udev.service, even though that service is
not started in containers. But are still many other units which that declare a
firewall, which is currently unsupported in containers. Let's stop warning
about this.
The warning is still emitted e.g. if legacy cgroups are used. This is something
that can be configured, so it makes more sense to emit the warning.
There's no need to warn about the firewall when parsing, because the unit might
not be started at all. Let's warn only when we're actually preparing to start
the firewall.
This changes behaviour:
- the warning is printed just once for all unit types, and not once
for normal units and once for transient units.
- on repeat warnings, the message is not printed at all. There's already
detailed debug info from bpf_firewall_compile(), so we don't need to repeat
ourselves.
- when we are not root, let's say precisely that, not "lack of necessary privileges"
and "the local system does not support BPF/cgroup firewalling".
Fixes#12673.
We store the affinity mask in the native endian. However, over D-Bus we
must transfer the mask in little endian byte order.
This is the second part of c367f996f5.
If we had a configuration setting from a configuration file, and it was
removed, we'd still remember the old value, because there's was no mechanism to
"reset" everything, just to assign new values.
Note that the effect of this is limited. For settings that have an "ongoing" effect,
like systemd.confirm_spawn, the new value is simply used. But some settings can only
be set at start.
In particular, CPUAffinity= will be updated if set to a new value, but if
CPUAffinity= is fully removed, it will not be reset, simply because we don't
know what to reset it to. We might have inherited a setting, or we might have
set it ourselves. In principle we could remember the "original" value that was
set when we were executed, but propagate this over reloads and reexecs, but
that would be a lot of work for little gain. So this corner case of removal of
CPUAffinity= is not handled fully, and a reboot is needed to execute the
change. As a work-around, a full mask of CPUAffinity=0-8191 can be specified.
We have settings which may be set on the kernel command line, and also
in /proc/cmdline (for pid1). The settings in /proc/cmdline have higher priority
of course. When a reload was done, we'd reload just the configuration file,
losing the overrides.
So read /proc/cmdline again during reload.
Also, when initially reading the configuration file when program starts,
don't treat any errors as fatal. The configuration done in there doesn't
seem important enough to refuse boot.
This makes the handling of this option match what we do in unit files. I think
consistency is important here. (As it happens, it is the only option in
system.conf that is "non-atomic", i.e. where there's a list of things which can
be split over multiple assignments. All other options are single-valued, so
there's no issue of how to handle multiple assignments.)
The CPU_SET_S api is pretty bad. In particular, it has a parameter for the size
of the array, but operations which take two (CPU_EQUAL_S) or even three arrays
(CPU_{AND,OR,XOR}_S) still take just one size. This means that all arrays must
be of the same size, or buffer overruns will occur. This is exactly what our
code would do, if it received an array of unexpected size over the network.
("Unexpected" here means anything different from what cpu_set_malloc() detects
as the "right" size.)
Let's rework this, and store the size in bytes of the allocated storage area.
The code will now parse any number up to 8191, independently of what the current
kernel supports. This matches the kernel maximum setting for any architecture,
to make things more portable.
Fixes#12605.
Only changing ownership back to root is not enough we also need to
change the access mode, otherwise the user might have set 666 first, and
thus allow everyone access before and after the chown().
I covered the most obvious paths: those where there's a clear problem
with a path specified by the user.
Prints something like this (at error level):
May 21 20:00:01.040418 systemd[125871]: bad-workdir.service: Failed to set up mount namespacing: /run/systemd/unit-root/etc/tomcat9/Catalina: No such file or directory
May 21 20:00:01.040456 systemd[125871]: bad-workdir.service: Failed at step NAMESPACE spawning /bin/true: No such file or directory
Fixes#10972.
The functions to retrieve and print process cmdlines were based on the
assumption that they contain printable ASCII, and everything else
should be filtered out. That assumption doesn't hold in today's world,
where people are free to use unicode everywhere.
This replaces the custom cmdline reading code with a more generic approach
using utf8_escape_non_printable_full().
For kernel threads, truncation is done on the parenthesized name, so we'll
get "[worker]", "[worker…]", …, "[w…]", "[…", "…" as we reduce the number of
available columns.
This implementation is most likely slower for very long cmdlines, but I don't
think this is very important. The common case is to have short commandlines,
and should print those properly. Absurdly long cmdlines are the exception,
which needs to be handled correctly and safely, but speed is not too important.
Fixes#12532.
v2:
- use size_t for the number of columns. This change propagates into various
other functions that call get_process_cmdline(), increasing the size of the
patch, but the changes are rather trivial.
These make sense to be explicitly set at 0 (which has a different effect
than the default, since it can affect processing of `DefaultMemoryXXX`).
Without this, it's not easily possible to relinquish memory protection
for a subtree, which is not great.
Coverity doesn't like the fact that unit_get_cgroup_context() returns NULL for
unit types that don't have a CGroupContext. We don't expect to call those
functions with such unit types, so this isn't an immediate problem, but we can
make things more robust by handling this case.
CID #1400683, #1400684.
A service might be able to detect errors by itself that may require the
system to take the same action as if the service locked up. Add a
WATCHDOG=trigger state change notification to sd_notify() to let the
service manager know about the self-detected misery and instantly
trigger the configured watchdog behaviour.
This wraps a few common steps. It is defined as inline function instead of in a
.c file to avoid having a .c file. With a .c file, we would have three choices:
- either link it into libshared, but then then libshared would have to be
linked to libmount.
- or compile the .c file into each target separately. This has the disdvantage
that configuration of every target has to be updated and stuff will be compiled
multiple times anyway, which is not too different from keeping this in the
header file.
- or create a new convenience library just for this. This also has the disadvantage
that the every target would have to be updated, and a separate library for a
10 line function seems overkill.
By keeping everything in a header file, we compile this a few times, but
otherwise it's the least painful option. The compiler can optimize most of the
function away, because it knows if 'source' is set or not.
It seems better to use just a single parsing algorithm for /proc/self/mountinfo.
Also, unify the naming of variables in all places that use mnt_table_next_fs().
It makes it easier to compare the different call sites.
When shooting down a service with SIGABRT the user might want to have a
much longer stop timeout than on regular stops/shutdowns. Especially in
the face of short stop timeouts the time might not be sufficient to
write huge core dumps before the service is killed.
This commit adds a dedicated (Default)TimeoutAbortSec= timer that is
used when stopping a service via SIGABRT. In all other cases the
existing TimeoutStopSec= is used. The timer value is unset by default
to skip the special handling and use TimeoutStopSec= for state
'stop-watchdog' to keep the old behaviour.
If the service is in state 'stop-watchdog' and the service should be
stopped explicitly we still go to 'stop-sigterm' and re-apply the usual
TimeoutStopSec= timeout.
In cgroup v2 we have protection tunables -- currently MemoryLow and
MemoryMin (there will be more in future for other resources, too). The
design of these protection tunables requires not only intermediate
cgroups to propagate protections, but also the units at the leaf of that
resource's operation to accept it (by setting MemoryLow or MemoryMin).
This makes sense from an low-level API design perspective, but it's a
good idea to also have a higher-level abstraction that can, by default,
propagate these resources to children recursively. In this patch, this
happens by having descendants set memory.low to N if their ancestor has
DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow
value.
Any affected unit can opt out of this propagation by manually setting
`MemoryLow` to some value in its unit configuration. A unit can also
stop further propagation by setting `DefaultMemoryLow=` with no
argument. This removes further propagation in the subtree, but has no
effect on the unit itself (for that, use `MemoryLow=0`).
Our use case in production is simplifying the configuration of machines
which heavily rely on memory protection tunables, but currently require
tweaking a huge number of unit files to make that a reality. This
directive makes that significantly less fragile, and decreases the risk
of misconfiguration.
After this patch is merged, I will implement DefaultMemoryMin= using the
same principles.
This was the last kind of accounting still not exposed on for each unit.
Let's fix that.
Note that this is a relatively simplistic approach: we don't expose
per-device stats, but sum them all up, much like cgtop does. This kind
of metric is probably the most interesting for most usecases, and covers
the "systemctl status" output best. If we want per-device stats one day
we can of course always add that eventually.
It's a simple wrapper for resetting both IP and CPU accounting in one
go.
This will become particularly useful when we also needs this to reset IO
accounting (to be added in a later commit).