This is mostly fall-out from d1a1f0aaf0,
however some cases are older bugs.
There might be more issues lurking, this was a simple grep for "%m"
across the tree, with all lines removed that mention "errno" at all.
Previously the enumerate() callback defined for each unit type would do
two things:
1. It would create perpetual units (i.e. -.slice, system.slice, -.mount and
init.scope)
2. It would enumerate units from /proc/self/mountinfo, /proc/swaps and
the udev database
With this change these two parts are split into two seperate methods:
enumerate() now only does #2, while enumerate_perpetual() is responsible
for #1. Why make this change? Well, perpetual units should have a
slightly different effect that those found through enumeration: as
perpetual units should be up unconditionally, perpetually and thus never
change state, they should also not pull in deps by their state changing,
not even when the state is first set to active. Thus, their state is
generally initialized through the per-device coldplug() method in
similar fashion to the deserialized state from a previous run would be
put into place. OTOH units found through regular enumeration should
result in state changes (and thus pull in deps due to state changes),
hence their state should be put in effect in the catchup() method
instead. Hence, given this difference, let's also separate the
functions, so that the rule is:
1. What is created in enumerate_perpetual() should be started in
coldplug()
2. What is created in enumerate() should be started in catchup().
This reworks how device units are "powered on".
This makes sure that any device changes that might have happened while
we were restarting/reloading will be noticed properly. For that we'll
now properly serialize/deserialize both the device unit state and the
device "found" flags, and restore these initially in the "coldplug"
phase of the manager deserialization. While enumerating the udev devices
during startup we'll put together a new "found" flags mask, which we'll
the switch to in the "catchup" phase of the manager deserialization,
which follows the "coldplug" phase.
Note that during the "coldplug" phase no unit state change events are
generated, which is different for the "catchall" phase which will do
that. Thus we correctly make sure that the deserialized state won't pull
in new deps, but any device's change while we were reloading would.
Fixes: #8832
Replaces: #8675
This is very similar to the existing unit method coldplug() but is
called a bit later. The idea is that that coldplug() restores the unit
state from before any prior reload/restart, i.e. puts the deserialized
state in effect. The catchup() call is then called a bit later, to
catch up with the system state for which we missed notifications while
we were reloading. This is only really useful for mount, swap and device
mount points were we should be careful to generate all missing unit
state change events (i.e. call unit_notify() appropriately) for
everything that happened while we were reloading.
let's drop the "now" argument, it's exactly what MANAGER_IS_RUNNING()
returns, hence let's use that instead to simplify things.
Moreover, let's change the add/found argument pair to become found/mask,
which allows us to change multiple flags at the same time into opposing
directions, which will be useful later on.
Also, let's change the return type to void. It's a notifier call where
callers will ignore the return value anyway as it is nothing actionable.
Should not change behaviour.
It's a bit prettier that day as the function won't silently overwrite
any possibly pre-initialized field, and destroy it right before we
allocate a new event source.
Scope units don't have a main or control process we can watch, hence
let's explicitly watch the PIDs contained in them early on, just to make
things more robust and have at least something to watch.
This reworks how systemd tracks processes on cgroupv1 systems where
cgroup notification is not reliable. Previously, whenever we had reason
to believe that new processes showed up or got removed we'd scan the
cgroup of the scope or service unit for new processes, and would tidy up
the list of PIDs previously watched. This scanning is relatively slow,
and does not scale well. With this change behaviour is changed: instead
of scanning for new/removed processes right away we do this work in a
per-unit deferred event loop job. This event source is scheduled at a
very low priority, so that it is executed when we have time but does not
starve other event sources. This has two benefits: this expensive work is
coalesced, if events happen in quick succession, and we won't delay
SIGCHLD handling for too long.
This patch basically replaces all direct invocation of
unit_watch_all_pids() in scope.c and service.c with invocations of the
new unit_enqueue_rewatch_pids() call which just enqueues a request of
watching/tidying up the PID sets (with one exception: in
scope_enter_signal() and service_enter_signal() we'll still do
unit_watch_all_pids() synchronously first, since we really want to know
all processes we are about to kill so that we can track them properly.
Moreover, all direct invocations of unit_tidy_watch_pids() and
unit_synthesize_cgroup_empty_event() are removed too, when the
unit_enqueue_rewatch_pids() call is invoked, as the queued job will run
those operations too.
All of this is done on cgroupsv1 systems only, and is disabled on
cgroupsv2 systems as cgroup-empty notifications are reliable there, and
we do not need SIGCHLD events to track processes there.
Fixes: #9138
This way we don't need to repeat the argument twice.
I didn't replace all instances. I think it's better to leave out:
- asserts
- comparisons like x & y == x, which are mathematically equivalent, but
here we aren't checking if flags are set, but if the argument fits in the
flags.
Previously, we'd not care about failures that were seen earlier and
remain in "exited" state. This could be triggered if the main process of
a service failed while ExecStartPost= was still running, as in that case
we'd not immediately act on the main process failure because we needed
to wait for ExecStartPost= to finish, before acting on it.
Fixes: #8929
The function is similar to path_kill_slashes() but also removes
initial './', trailing '/.', and '/./' in the path.
When the second argument of path_simplify() is false, then it
behaves as the same as path_kill_slashes(). Hence, this also
replaces path_kill_slashes() with path_simplify().
This adds a flags parameter to unit_notify() which can be used to pass
additional notification information to the function. We the make the old
reload_failure boolean parameter one of these flags, and then add a new
flag that let's unit_notify() if we are configured to restart the
service.
Note that this adjusts behaviour of systemd to match what the docs say.
Fixes: #8398
This corresponds nicely with the specifiers we already pass for
/var/lib, /var/cache, /run and so on.
This is particular useful to update the test-path service files to
operate without guessable files, thus allowing multiple parallel
test-path invocations to pass without issues (the idea is to set $TMPDIR
early on in the test to some private directory, and then only use the
new %T or %V specifier to refer to it).
The symbolic links to private directories specified by StateDirectory=
or its friends are created on the host. So, when DynamicUser= and
RootDirectory=/RootImage= are set, then the executed process cannot
access private directory.
This makes the private directories are mounted on the non-private place
when both DynamicUser= and RootDirectory=/RootImage= are set.
Fixes#8965.
Most our other parsing functions do this, let's do this here too,
internally we accept that anyway. Also, the closely related
load_env_file() and load_env_file_pairs() also do this, so let's be
systematic.
It is not const, because a) systemd can bump it on its own if
errors occur, and b) the user can change it using signals.
Also it's not boolean.
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager ShowStatus
b true
$ sudo kill -SIGRTMIN+21 1
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager ShowStatus
b false
Fixes#4503.
When DynamicUser= is set, then RuntimeDirectory= should be always
chowned, as the service unit may enable RuntimeDirectoryPreserve=,
and the uid or gid may changed from the last run.
This also makes easier to migrate the service to use DynamicUser=.
This makes most header files easier to look at. Also Emacs gets really
slow when browsing through large sections of overly long prototypes,
which is much improved by this macro.
We should probably not do something similar with too many other cases,
as macros like this might help readability for some, but make it worse
for others. But I think given the complexity of this specific prototype
and how often we use it, it's worth doing.
When "systemctl daemon-reload" is run at the same time as "systemctl
start foo", the latter might hang. That's because commands like start
wait for JobRemoved signal to know when the job is finished. But if the
job is finished during reloading, the signal is never sent.
The hang can be easily reproduced by running
# for ((N=1; N>0; N++)) ; do echo $N ; systemctl daemon-reload ; done
# for ((N=1; N>0; N++)) ; do echo $N ; systemctl start systemd-coredump.socket ; done
in two different terminals. The start command will hang after 1-2
iterations.
This keeps track of jobs that were started before reload and finished
during it and sends JobRemoved after the reload has finished.
And port config_parse_exec_oom_score_adjust() over to use it.
While we are at it, let's also fix config_parse_exec_oom_score_adjust()
to accept an empty string for turning off OOM score adjustments set
earlier.
That way we can use it in nspawn.
Also, while we are at it, let's rename the call config_parse_rlimit(),
i.e. insert the "r", to clarify what kind of limit this is about.
Commenting out "WatchdogTimeout=3min" in systemd-logind.service causes
NotifyAccess to go from "main" to "none", breaking support for logind
restart. Let's fix that.
We'd write a sequence that was invalid unicode and this caused the d-bus
connection to be terminated:
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_2esocket org.freedesktop.systemd1.Unit SubState
s "running"
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_e2socket org.freedesktop.systemd1.Unit SubState
Remote peer disconnected
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_e2socket org.freedesktop.systemd1.Unit SubState
(hangs)
Fixes#8978.
When I see "test", I have to think three times what the return value
means. With "below" this is immediately clear. ratelimit_below(&limit)
sounds almost like English and is imho immediately obvious.
(I also considered ratelimit_ok, but this strongly implies that being under the
limit is somehow better. Most of the times this is true, but then we use the
ratelimit to detect triple-c-a-d, and "ok" doesn't fit so well there.)
C.f. a1bcaa07.
This adds a number of entries nspawn already applies to regular service
namespacing too. Most importantly let's mask /proc/kcore and
/proc/kallsyms too.
if we lack privs to create device nodes that's fine, and creating
/run/systemd/inaccessible/chr or /run/systemd/inaccessible/blk won't
work then. Document this in longer comments.
Fixes: #4484
Scope units are populated from PIDs specified by the bus client. We do
that when a scope is started. We really shouldn't allow scopes to be
started multiple times, as the PIDs then might be heavily out of date.
Moreover, clients should have the guarantee that any scope they allocate
has a clear runtime cycle which is not repetitive.
Previously we were a bit sloppy with the index and size types of arrays,
we'd regularly use unsigned. While I don't think this ever resulted in
real issues I think we should be more careful there and follow a
stricter regime: unless there's a strong reason not to use size_t for
array sizes and indexes, size_t it should be. Any allocations we do
ultimately will use size_t anyway, and converting forth and back between
unsigned and size_t will always be a source of problems.
Note that on 32bit machines "unsigned" and "size_t" are equivalent, and
on 64bit machines our arrays shouldn't grow that large anyway, and if
they do we have a problem, however that kind of overly large allocation
we have protections for usually, but for overflows we do not have that
so much, hence let's add it.
So yeah, it's a story of the current code being already "good enough",
but I think some extra type hygiene is better.
This patch tries to be comprehensive, but it probably isn't and I missed
a few cases. But I guess we can cover that later as we notice it. Among
smaller fixes, this changes:
1. strv_length()' return type becomes size_t
2. the unit file changes array size becomes size_t
3. DNS answer and query array sizes become size_t
Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=76745
In particular, this confirms that the Found state needs to remain a bit-field:
$ systemd-analyze dump |grep 'Found: '|sort |uniq -c
105 Found: found-udev
3 Found: found-udev,found-mount
1 Found: found-udev,found-swap
(or in terms of the names of the actual files on disk, do not link
libsystemd-shared-238.a into libcore.a).
libsystemd_static is linked into libsystemd_shared, which in turn means that
anything that links to libcore and libsystemd_shared will get libsystemd_static
twice:
$ cc -o systemd 'systemd@exe/src_core_main.c.o' -Wl,--no-undefined -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -pie -DVALGRIND -Wl,--start-group src/core/libcore.a src/shared/libsystemd-shared-238.a src/shared/libsystemd-shared-238.so -pthread -lrt -lseccomp -lselinux -lmount -lblkid -Wl,--end-group -lseccomp -lpam -L/lib64 -laudit -lkmod -lmount -lrt -lcap -lacl -lcryptsetup -lgcrypt -lip4tc -lip6tc -lseccomp -lselinux -lidn -llzma -llz4 -lblkid '-Wl,-rpath,$ORIGIN/src/shared' -Wl,-rpath-link,/home/zbyszek/src/systemd/build/src/shared
This propagation of the dependency seems correct (in the sense that meson is
doing the expected thing based on the given configuration). Linking was done
this way in the original meson conversion. I was probably trying to get
everything to compile and link, I'm not sure why this particular choice was
made. In the meantime, meson has gotten better at propagating dependencies, so
it's possible that this had slightly different effect in the original
conversion, but I did not verify this. Either way, I think we should drop this.
With the patch:
$ cc -o systemd 'systemd@exe/src_core_main.c.o' -Wl,--no-undefined -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -pie -DVALGRIND -Wl,--start-group src/core/libcore.a src/shared/libsystemd-shared-238.so -pthread -lrt -lseccomp -lselinux -lmount -Wl,--end-group -lblkid -lrt -lseccomp -lpam -L/lib64 -laudit -lkmod -lselinux -lmount '-Wl,-rpath,$ORIGIN/src/shared' -Wl,-rpath-link,/home/zbyszek/src/systemd/build/src/shared
This is more correct because we're not linking the same code twice.
With the patch, libystemd_static is used in exactly four places:
- src/shared/libsystemd-shared-238.so
- src/udev/libudev.so.1.6.10
- pam_systemd.so
- test-bus-error
(compared to a bunch more executables before, including systemd,
systemd-analyze, test-hostname, test-ns, etc.)
Size savings are also noticable:
$ size /var/tmp/inst?/usr/lib/systemd/libsystemd-shared-238.so
text data bss dec hex filename
2397826 578488 15920 2992234 2da86a /var/tmp/inst1/usr/lib/systemd/libsystemd-shared-238.so
2397826 578488 15920 2992234 2da86a /var/tmp/inst2/usr/lib/systemd/libsystemd-shared-238.so
$ size /var/tmp/inst?/usr/lib/systemd/systemd
text data bss dec hex filename
1858790 261688 9320 2129798 207f86 /var/tmp/inst1/usr/lib/systemd/systemd
1556358 258704 8072 1823134 1bd19e /var/tmp/inst2/usr/lib/systemd/systemd
$ du -s /var/tmp/inst?
52216 /var/tmp/inst1
50844 /var/tmp/inst2
https://github.com/google/oss-fuzz/issues/1330#issuecomment-384054530 might be related.
This is useful for foreign container runtimes implementing the OCI
runtime spec, which only wants to deal with cgroup paths. There's
already an API to translate units into cgroup paths, with this we add
the reverse.
This drops a good number of type-specific _cleanup_ macros, and patches
all users to just use the generic ones.
In most recent code we abstained from defining type-specific macros, and
this basically removes all those added already, with the exception of
the really low-level ones.
Having explicit macros for this is not too useful, as the expression
without the extra macro is generally just 2ch wider. We should generally
emphesize generic code, unless there are really good reasons for
specific code, hence let's follow this in this case too.
Note that _cleanup_free_ and similar really low-level, libc'ish, Linux
API'ish macros continue to be defined, only the really high-level OO
ones are dropped. From now on this should really be the rule: for really
low-level stuff, such as memory allocation, fd handling and so one, go
ahead and define explicit per-type macros, but for high-level, specific
program code, just use the generic _cleanup_() macro directly, in order
to keep things simple and as readable as possible for the uninitiated.
Note that before this patch some of the APIs (notable libudev ones) were
already used with the high-level macros at some places and with the
generic _cleanup_ macro at others. With this patch we hence unify on the
latter.
I'm not sure if I understand the original code. AFAICS, errno does not
have to be set at all in this callback.
ratelimit_test() returns positive if we are under limit. The code would only
log if the condition happened very often, which I assume is not inteded, and
this check was supposed to prevent too much logging.
Those are quite similar to %i/%I, but refer to the last dash-separated
component of the name prefix.
The new functionality of dash-dropins could largely supersede the template
functionality, so it would be tempting to overload %i/%I. But that would
not be backwards compatible. So let's add the two new letters instead.
Do not try to party initialize a device during deserialization if it's not
known by udev (anymore) and therefore hasn't been seen during device
enumeration.
The device unit in this case has not been initialized properly and setting it
in the "plugged" state can be confusing.
Actually this happens during every boots when PID switches to the new rootfs:
PID is reexecuted and enumerates devices but since udev is not running, the
list of enumerated devices is empty.
PID1 updates the state of device units upon 2 different events:
- when it processes an event sent by udev and in this case the device deps are
started if the device enters in the "plugged" state.
- when it enumerates all devices during its startup or when it is asked to
reload its configuration data but in this case the device deps (if any) are
not retroactively started.
When udev processes a new "add" kernel event, it first registers the new device
in its databases then sends an event to systemd.
If for any reason, systemd is asked to reload its configuration between the
previous 2 steps, it might see for the first time the new device while scanning
/sys for all devices. Only during a second step, udev will send the event for
the new device.
In this peculiar case the device deps wont be started (even though the device
is first seen by PID1).
Indeed when reloading its configurations, PID1 will put the device unit in the
"plugged" state but without starting the device deps. Thereafter PID1 will get
the event from udev for the new device but the device unit will be in "plugged"
state already therefore it won't see any need to start the device dependencies.
Rather than assuming that during the reloading of systemd manager configuration
all devices listed in udev DBs have been already processed and should be put in
the "plugged" state (done by device_coldplug()), this patch does that only for
devices which have been processed via an udev event (device_dispatch_io())
previously. In this case we set "d->found" to "DEVICE_FOUND_UDEV" and we make
also sure to no more initialize "d->found" while enumerating devices. Instead
this field is now saved/restored while devices are serialized.
Double newlines (i.e. one empty lines) are great to structure code. But
let's avoid triple newlines (i.e. two empty lines), quadruple newlines,
quintuple newlines, …, that's just spurious whitespace.
It's an easy way to drop 121 lines of code, and keeps the coding style
of our sources a bit tigther.
We check the same condition at various places. Let's add a trivial,
common helper for this, and use it everywhere.
It's not going to make things much faster or much shorter, but I think a
lot more readable
Before this patch we'd resolve all symlinks of bind mounts and other
mount points to establish for a service in advance, and only then start
mounting them. This is problematic, if symlink chains jump around
between directories in a namespace tree, so that to resolve a specific
symlink chain we need to establish another mount already. A typical case
where this happens is if /etc/resolv.conf is a symlink to some file in
/run: in that case we'd normally resolve and mount /etc/resolv.conf
early on, but that's broken, as to do this properly we'd need to resolve
/etc/resolv.conf first, then figure out that /run needs to be mounted
before we can proceed, and thus reorder the order in which we apply
mounts dynamically.
With this change, whenever we are about to apply a mount, we'll do a
single step of the symlink normalization process, patch the mount entry
accordingly, and then sort the list of mounts to establish again, taking
the new path into account. This means that we can correctly deal with
the example above: we might start with wanting to mount /etc/resolv.conf
early, but after resolving it to the path in /run/ we'd push it to the
end of the list, ensuring that /run is mounted first.
(Note that this also fixes another bug: we were following symlinks on
the bind mount source relative to the root directory of the service,
rather than of the host. That's wrong though as we explicitly document
tha the source of bind mounts is always on the host.)
Absolute paths make everything simple and quick, but sometimes this requirement
can be annoying. A good example is calling 'test', which will be located in
/usr/bin/ or /bin depending on the distro. The need the provide the full path
makes it harder a portable unit file in such cases.
This patch uses a fixed search path (DEFAULT_PATH which was already used as the
default value of $PATH), and if a non-absolute file name is found, it is
immediately resolved to a full path using this search path when the unit is
loaded. After that, everything behaves as if an absolute path was specified. In
particular, the executable must exist when the unit is loaded.
Returning 0 on not-found/wrong-type is confusing. Let's return -ENXIO in that
case instead, and explicitly ignore it in the call site where we want to do that.
I think this is clearer and less likely to be used errenously in case another
call site is added.
C.f. 152c475f95 and 98b1d2b8d9.
In environments where CAP_MKNOD is not available or inside
user namespaces it is still desirable to enable services to use
PrivateDevices= . So fall back to using bind-mounts on EPERM.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.