It's a bit prettier that day as the function won't silently overwrite
any possibly pre-initialized field, and destroy it right before we
allocate a new event source.
Scope units don't have a main or control process we can watch, hence
let's explicitly watch the PIDs contained in them early on, just to make
things more robust and have at least something to watch.
This reworks how systemd tracks processes on cgroupv1 systems where
cgroup notification is not reliable. Previously, whenever we had reason
to believe that new processes showed up or got removed we'd scan the
cgroup of the scope or service unit for new processes, and would tidy up
the list of PIDs previously watched. This scanning is relatively slow,
and does not scale well. With this change behaviour is changed: instead
of scanning for new/removed processes right away we do this work in a
per-unit deferred event loop job. This event source is scheduled at a
very low priority, so that it is executed when we have time but does not
starve other event sources. This has two benefits: this expensive work is
coalesced, if events happen in quick succession, and we won't delay
SIGCHLD handling for too long.
This patch basically replaces all direct invocation of
unit_watch_all_pids() in scope.c and service.c with invocations of the
new unit_enqueue_rewatch_pids() call which just enqueues a request of
watching/tidying up the PID sets (with one exception: in
scope_enter_signal() and service_enter_signal() we'll still do
unit_watch_all_pids() synchronously first, since we really want to know
all processes we are about to kill so that we can track them properly.
Moreover, all direct invocations of unit_tidy_watch_pids() and
unit_synthesize_cgroup_empty_event() are removed too, when the
unit_enqueue_rewatch_pids() call is invoked, as the queued job will run
those operations too.
All of this is done on cgroupsv1 systems only, and is disabled on
cgroupsv2 systems as cgroup-empty notifications are reliable there, and
we do not need SIGCHLD events to track processes there.
Fixes: #9138
This way we don't need to repeat the argument twice.
I didn't replace all instances. I think it's better to leave out:
- asserts
- comparisons like x & y == x, which are mathematically equivalent, but
here we aren't checking if flags are set, but if the argument fits in the
flags.
Previously, we'd not care about failures that were seen earlier and
remain in "exited" state. This could be triggered if the main process of
a service failed while ExecStartPost= was still running, as in that case
we'd not immediately act on the main process failure because we needed
to wait for ExecStartPost= to finish, before acting on it.
Fixes: #8929
The function is similar to path_kill_slashes() but also removes
initial './', trailing '/.', and '/./' in the path.
When the second argument of path_simplify() is false, then it
behaves as the same as path_kill_slashes(). Hence, this also
replaces path_kill_slashes() with path_simplify().
This adds a flags parameter to unit_notify() which can be used to pass
additional notification information to the function. We the make the old
reload_failure boolean parameter one of these flags, and then add a new
flag that let's unit_notify() if we are configured to restart the
service.
Note that this adjusts behaviour of systemd to match what the docs say.
Fixes: #8398
This corresponds nicely with the specifiers we already pass for
/var/lib, /var/cache, /run and so on.
This is particular useful to update the test-path service files to
operate without guessable files, thus allowing multiple parallel
test-path invocations to pass without issues (the idea is to set $TMPDIR
early on in the test to some private directory, and then only use the
new %T or %V specifier to refer to it).
The symbolic links to private directories specified by StateDirectory=
or its friends are created on the host. So, when DynamicUser= and
RootDirectory=/RootImage= are set, then the executed process cannot
access private directory.
This makes the private directories are mounted on the non-private place
when both DynamicUser= and RootDirectory=/RootImage= are set.
Fixes#8965.
Most our other parsing functions do this, let's do this here too,
internally we accept that anyway. Also, the closely related
load_env_file() and load_env_file_pairs() also do this, so let's be
systematic.
It is not const, because a) systemd can bump it on its own if
errors occur, and b) the user can change it using signals.
Also it's not boolean.
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager ShowStatus
b true
$ sudo kill -SIGRTMIN+21 1
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager ShowStatus
b false
Fixes#4503.
When DynamicUser= is set, then RuntimeDirectory= should be always
chowned, as the service unit may enable RuntimeDirectoryPreserve=,
and the uid or gid may changed from the last run.
This also makes easier to migrate the service to use DynamicUser=.
This makes most header files easier to look at. Also Emacs gets really
slow when browsing through large sections of overly long prototypes,
which is much improved by this macro.
We should probably not do something similar with too many other cases,
as macros like this might help readability for some, but make it worse
for others. But I think given the complexity of this specific prototype
and how often we use it, it's worth doing.
When "systemctl daemon-reload" is run at the same time as "systemctl
start foo", the latter might hang. That's because commands like start
wait for JobRemoved signal to know when the job is finished. But if the
job is finished during reloading, the signal is never sent.
The hang can be easily reproduced by running
# for ((N=1; N>0; N++)) ; do echo $N ; systemctl daemon-reload ; done
# for ((N=1; N>0; N++)) ; do echo $N ; systemctl start systemd-coredump.socket ; done
in two different terminals. The start command will hang after 1-2
iterations.
This keeps track of jobs that were started before reload and finished
during it and sends JobRemoved after the reload has finished.
And port config_parse_exec_oom_score_adjust() over to use it.
While we are at it, let's also fix config_parse_exec_oom_score_adjust()
to accept an empty string for turning off OOM score adjustments set
earlier.
That way we can use it in nspawn.
Also, while we are at it, let's rename the call config_parse_rlimit(),
i.e. insert the "r", to clarify what kind of limit this is about.
Commenting out "WatchdogTimeout=3min" in systemd-logind.service causes
NotifyAccess to go from "main" to "none", breaking support for logind
restart. Let's fix that.
We'd write a sequence that was invalid unicode and this caused the d-bus
connection to be terminated:
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_2esocket org.freedesktop.systemd1.Unit SubState
s "running"
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_e2socket org.freedesktop.systemd1.Unit SubState
Remote peer disconnected
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_e2socket org.freedesktop.systemd1.Unit SubState
(hangs)
Fixes#8978.
When I see "test", I have to think three times what the return value
means. With "below" this is immediately clear. ratelimit_below(&limit)
sounds almost like English and is imho immediately obvious.
(I also considered ratelimit_ok, but this strongly implies that being under the
limit is somehow better. Most of the times this is true, but then we use the
ratelimit to detect triple-c-a-d, and "ok" doesn't fit so well there.)
C.f. a1bcaa07.
This adds a number of entries nspawn already applies to regular service
namespacing too. Most importantly let's mask /proc/kcore and
/proc/kallsyms too.
if we lack privs to create device nodes that's fine, and creating
/run/systemd/inaccessible/chr or /run/systemd/inaccessible/blk won't
work then. Document this in longer comments.
Fixes: #4484
Scope units are populated from PIDs specified by the bus client. We do
that when a scope is started. We really shouldn't allow scopes to be
started multiple times, as the PIDs then might be heavily out of date.
Moreover, clients should have the guarantee that any scope they allocate
has a clear runtime cycle which is not repetitive.
Previously we were a bit sloppy with the index and size types of arrays,
we'd regularly use unsigned. While I don't think this ever resulted in
real issues I think we should be more careful there and follow a
stricter regime: unless there's a strong reason not to use size_t for
array sizes and indexes, size_t it should be. Any allocations we do
ultimately will use size_t anyway, and converting forth and back between
unsigned and size_t will always be a source of problems.
Note that on 32bit machines "unsigned" and "size_t" are equivalent, and
on 64bit machines our arrays shouldn't grow that large anyway, and if
they do we have a problem, however that kind of overly large allocation
we have protections for usually, but for overflows we do not have that
so much, hence let's add it.
So yeah, it's a story of the current code being already "good enough",
but I think some extra type hygiene is better.
This patch tries to be comprehensive, but it probably isn't and I missed
a few cases. But I guess we can cover that later as we notice it. Among
smaller fixes, this changes:
1. strv_length()' return type becomes size_t
2. the unit file changes array size becomes size_t
3. DNS answer and query array sizes become size_t
Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=76745
In particular, this confirms that the Found state needs to remain a bit-field:
$ systemd-analyze dump |grep 'Found: '|sort |uniq -c
105 Found: found-udev
3 Found: found-udev,found-mount
1 Found: found-udev,found-swap
(or in terms of the names of the actual files on disk, do not link
libsystemd-shared-238.a into libcore.a).
libsystemd_static is linked into libsystemd_shared, which in turn means that
anything that links to libcore and libsystemd_shared will get libsystemd_static
twice:
$ cc -o systemd 'systemd@exe/src_core_main.c.o' -Wl,--no-undefined -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -pie -DVALGRIND -Wl,--start-group src/core/libcore.a src/shared/libsystemd-shared-238.a src/shared/libsystemd-shared-238.so -pthread -lrt -lseccomp -lselinux -lmount -lblkid -Wl,--end-group -lseccomp -lpam -L/lib64 -laudit -lkmod -lmount -lrt -lcap -lacl -lcryptsetup -lgcrypt -lip4tc -lip6tc -lseccomp -lselinux -lidn -llzma -llz4 -lblkid '-Wl,-rpath,$ORIGIN/src/shared' -Wl,-rpath-link,/home/zbyszek/src/systemd/build/src/shared
This propagation of the dependency seems correct (in the sense that meson is
doing the expected thing based on the given configuration). Linking was done
this way in the original meson conversion. I was probably trying to get
everything to compile and link, I'm not sure why this particular choice was
made. In the meantime, meson has gotten better at propagating dependencies, so
it's possible that this had slightly different effect in the original
conversion, but I did not verify this. Either way, I think we should drop this.
With the patch:
$ cc -o systemd 'systemd@exe/src_core_main.c.o' -Wl,--no-undefined -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -pie -DVALGRIND -Wl,--start-group src/core/libcore.a src/shared/libsystemd-shared-238.so -pthread -lrt -lseccomp -lselinux -lmount -Wl,--end-group -lblkid -lrt -lseccomp -lpam -L/lib64 -laudit -lkmod -lselinux -lmount '-Wl,-rpath,$ORIGIN/src/shared' -Wl,-rpath-link,/home/zbyszek/src/systemd/build/src/shared
This is more correct because we're not linking the same code twice.
With the patch, libystemd_static is used in exactly four places:
- src/shared/libsystemd-shared-238.so
- src/udev/libudev.so.1.6.10
- pam_systemd.so
- test-bus-error
(compared to a bunch more executables before, including systemd,
systemd-analyze, test-hostname, test-ns, etc.)
Size savings are also noticable:
$ size /var/tmp/inst?/usr/lib/systemd/libsystemd-shared-238.so
text data bss dec hex filename
2397826 578488 15920 2992234 2da86a /var/tmp/inst1/usr/lib/systemd/libsystemd-shared-238.so
2397826 578488 15920 2992234 2da86a /var/tmp/inst2/usr/lib/systemd/libsystemd-shared-238.so
$ size /var/tmp/inst?/usr/lib/systemd/systemd
text data bss dec hex filename
1858790 261688 9320 2129798 207f86 /var/tmp/inst1/usr/lib/systemd/systemd
1556358 258704 8072 1823134 1bd19e /var/tmp/inst2/usr/lib/systemd/systemd
$ du -s /var/tmp/inst?
52216 /var/tmp/inst1
50844 /var/tmp/inst2
https://github.com/google/oss-fuzz/issues/1330#issuecomment-384054530 might be related.
This is useful for foreign container runtimes implementing the OCI
runtime spec, which only wants to deal with cgroup paths. There's
already an API to translate units into cgroup paths, with this we add
the reverse.
This drops a good number of type-specific _cleanup_ macros, and patches
all users to just use the generic ones.
In most recent code we abstained from defining type-specific macros, and
this basically removes all those added already, with the exception of
the really low-level ones.
Having explicit macros for this is not too useful, as the expression
without the extra macro is generally just 2ch wider. We should generally
emphesize generic code, unless there are really good reasons for
specific code, hence let's follow this in this case too.
Note that _cleanup_free_ and similar really low-level, libc'ish, Linux
API'ish macros continue to be defined, only the really high-level OO
ones are dropped. From now on this should really be the rule: for really
low-level stuff, such as memory allocation, fd handling and so one, go
ahead and define explicit per-type macros, but for high-level, specific
program code, just use the generic _cleanup_() macro directly, in order
to keep things simple and as readable as possible for the uninitiated.
Note that before this patch some of the APIs (notable libudev ones) were
already used with the high-level macros at some places and with the
generic _cleanup_ macro at others. With this patch we hence unify on the
latter.
I'm not sure if I understand the original code. AFAICS, errno does not
have to be set at all in this callback.
ratelimit_test() returns positive if we are under limit. The code would only
log if the condition happened very often, which I assume is not inteded, and
this check was supposed to prevent too much logging.
Those are quite similar to %i/%I, but refer to the last dash-separated
component of the name prefix.
The new functionality of dash-dropins could largely supersede the template
functionality, so it would be tempting to overload %i/%I. But that would
not be backwards compatible. So let's add the two new letters instead.
Do not try to party initialize a device during deserialization if it's not
known by udev (anymore) and therefore hasn't been seen during device
enumeration.
The device unit in this case has not been initialized properly and setting it
in the "plugged" state can be confusing.
Actually this happens during every boots when PID switches to the new rootfs:
PID is reexecuted and enumerates devices but since udev is not running, the
list of enumerated devices is empty.
PID1 updates the state of device units upon 2 different events:
- when it processes an event sent by udev and in this case the device deps are
started if the device enters in the "plugged" state.
- when it enumerates all devices during its startup or when it is asked to
reload its configuration data but in this case the device deps (if any) are
not retroactively started.
When udev processes a new "add" kernel event, it first registers the new device
in its databases then sends an event to systemd.
If for any reason, systemd is asked to reload its configuration between the
previous 2 steps, it might see for the first time the new device while scanning
/sys for all devices. Only during a second step, udev will send the event for
the new device.
In this peculiar case the device deps wont be started (even though the device
is first seen by PID1).
Indeed when reloading its configurations, PID1 will put the device unit in the
"plugged" state but without starting the device deps. Thereafter PID1 will get
the event from udev for the new device but the device unit will be in "plugged"
state already therefore it won't see any need to start the device dependencies.
Rather than assuming that during the reloading of systemd manager configuration
all devices listed in udev DBs have been already processed and should be put in
the "plugged" state (done by device_coldplug()), this patch does that only for
devices which have been processed via an udev event (device_dispatch_io())
previously. In this case we set "d->found" to "DEVICE_FOUND_UDEV" and we make
also sure to no more initialize "d->found" while enumerating devices. Instead
this field is now saved/restored while devices are serialized.
Double newlines (i.e. one empty lines) are great to structure code. But
let's avoid triple newlines (i.e. two empty lines), quadruple newlines,
quintuple newlines, …, that's just spurious whitespace.
It's an easy way to drop 121 lines of code, and keeps the coding style
of our sources a bit tigther.
We check the same condition at various places. Let's add a trivial,
common helper for this, and use it everywhere.
It's not going to make things much faster or much shorter, but I think a
lot more readable
Before this patch we'd resolve all symlinks of bind mounts and other
mount points to establish for a service in advance, and only then start
mounting them. This is problematic, if symlink chains jump around
between directories in a namespace tree, so that to resolve a specific
symlink chain we need to establish another mount already. A typical case
where this happens is if /etc/resolv.conf is a symlink to some file in
/run: in that case we'd normally resolve and mount /etc/resolv.conf
early on, but that's broken, as to do this properly we'd need to resolve
/etc/resolv.conf first, then figure out that /run needs to be mounted
before we can proceed, and thus reorder the order in which we apply
mounts dynamically.
With this change, whenever we are about to apply a mount, we'll do a
single step of the symlink normalization process, patch the mount entry
accordingly, and then sort the list of mounts to establish again, taking
the new path into account. This means that we can correctly deal with
the example above: we might start with wanting to mount /etc/resolv.conf
early, but after resolving it to the path in /run/ we'd push it to the
end of the list, ensuring that /run is mounted first.
(Note that this also fixes another bug: we were following symlinks on
the bind mount source relative to the root directory of the service,
rather than of the host. That's wrong though as we explicitly document
tha the source of bind mounts is always on the host.)
Absolute paths make everything simple and quick, but sometimes this requirement
can be annoying. A good example is calling 'test', which will be located in
/usr/bin/ or /bin depending on the distro. The need the provide the full path
makes it harder a portable unit file in such cases.
This patch uses a fixed search path (DEFAULT_PATH which was already used as the
default value of $PATH), and if a non-absolute file name is found, it is
immediately resolved to a full path using this search path when the unit is
loaded. After that, everything behaves as if an absolute path was specified. In
particular, the executable must exist when the unit is loaded.
Returning 0 on not-found/wrong-type is confusing. Let's return -ENXIO in that
case instead, and explicitly ignore it in the call site where we want to do that.
I think this is clearer and less likely to be used errenously in case another
call site is added.
C.f. 152c475f95 and 98b1d2b8d9.
In environments where CAP_MKNOD is not available or inside
user namespaces it is still desirable to enable services to use
PrivateDevices= . So fall back to using bind-mounts on EPERM.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
sd_bus_open/sd_bus_open_system/sd_bus_open_user are convenient, but
don't allow the description to be set. After they return, the bus is
is already started, and sd_bus_set_description() fails with -EBUSY.
It would be possible to allow sd_bus_set_description() to update the
description "live", but messages are already emitted from sd_bus_open
functions, so it's better to allow the description to be set in
sd_bus_open/sd_bus_open_system/sd_bus_open_user.
Fixes message like:
Bus n/a: changing state UNSET → OPENING
The initial fix for relabelling the cgroup filesystem for
SELinux delivered in commit 8739f23e3 was based on the assumption that
the cgroup filesystem is already populated once mount_setup() is
executed, which was true for my system. What I wasn't aware is that this
is the case only when another instance of systemd was running before
this one, which can happen if systemd is used in the initrd (for ex. by
dracut).
In case of a clean systemd start-up the cgroup filesystem is actually
being populated after mount_setup() and does not need relabelling as at
that moment the SELinux policy is already loaded. Since however the root
cgroup filesystem was remounted read-only in the meantime this operation
will now fail.
To fix this check for the filesystem mount flags before relabelling and
only remount ro->rw->ro if necessary and leave the filesystem read-write
otherwise.
Fixes#7901.
Re-use the hacks used to link user keyring, when creating the session
keyring. This way changing ownership of the keyring is not required, and thus
incovation_id can be correctly created in restricted environments.
Creating invocation_id with root permissions works and linking it into session
keyring works, as at that point session keyring is possessed.
Simple way to validate this is with following commands:
$ journalctl -f &
$ sudo systemd-run --uid 1000 /bin/sh -c 'keyctl describe @s; keyctl list @s; keyctl read `keyctl search @s user invocation_id`'
which now works in LXD containers as well as on the host.
Fixes: https://github.com/systemd/systemd/issues/7655
This reworks the SELinux and SMACK label fixing calls in a number of
ways:
1. The two separate boolean arguments of these functions are converted
into a flags type LabelFixFlags.
2. The operations are now implemented based on O_PATH. This should
resolve TTOCTTOU races between determining the label for the file
system object and applying it, as it it allows to pin the object
while we are operating on it.
3. When changing a label fails we'll query the label previously set, and
if matches what we want to set anyway we'll suppress the error.
Also, all calls to label_fix() are now (void)ified, when we ignore the
return values.
Fixes: #8566
linux/fs.h sys/mount.h, libmount.h and missing.h all include MS_*
definitions.
To avoid problems, only one of linux/fs.h, sys/mount.h and libmount.h
should be included. And missing.h must be included last.
Without this, building systemd may fail with:
In file included from [...]/libmount/libmount.h:31:0,
from ../systemd-238/src/core/manager.h:23,
from ../systemd-238/src/core/emergency-action.h:37,
from ../systemd-238/src/core/unit.h:34,
from ../systemd-238/src/core/dbus-timer.h:25,
from ../systemd-238/src/core/timer.c:26:
[...]/sys/mount.h:57:2: error: expected identifier before numeric constant
Currently we add target dependencies while we are loading units. This
can create ordering loops even if configuration doesn't contain any
loop. Take for example following configuration,
$ systemctl get-default
multi-user.target
$ cat /etc/systemd/system/test.service
[Unit]
After=default.target
[Service]
ExecStart=/bin/true
[Install]
WantedBy=multi-user.target
If we encounter such unit file early during manager start-up (e.g. load
queue is dispatched while enumerating devices due to SYSTEMD_WANTS in
udev rules) we would add stub unit default.target and we order it Before
test.service. At the same time we add implicit Before to
multi-user.target. Later we merge two units and we create ordering cycle
in the process.
To fix the issue we will now never add any target dependencies until we
loaded all the unit files and resolved all the aliases.
When we are attempting to create directory somewhere in the bowels of /var/lib
and get an error that it already exists, it can be quite hard to diagnose what
is wrong (especially for a user who is not aware that the directory must have
the specified owner, and permissions not looser than what was requested). Let's
print a warning in most cases. A warning is appropriate, because such state is
usually a sign of borked installation and needs to be resolved by the adminstrator.
$ build/test-fs-util
Path "/tmp/test-readlink_and_make_absolute" already exists and is not a directory, refusing.
(or)
Directory "/tmp/test-readlink_and_make_absolute" already exists, but has mode 0775 that is too permissive (0755 was requested), refusing.
(or)
Directory "/tmp/test-readlink_and_make_absolute" already exists, but is owned by 1001:1000 (1000:1000 was requested), refusing.
Assertion 'mkdir_safe(tempdir, 0755, getuid(), getgid(), MKDIR_WARN_MODE) >= 0' failed at ../src/test/test-fs-util.c:320, function test_readlink_and_make_absolute(). Aborting.
No functional change except for the new log lines.