This new setting is supposed to be useful in most cases where
"MountFlags=slave" is currently used, i.e. as an explicit way to run a
service in its own mount namespace and decouple propagation from all
mounts of the new mount namespace towards the host.
The effect of MountFlags=slave and PrivateMounts=yes is mostly the same,
as both cause a CLONE_NEWNS namespace to be opened, and both will result
in all mounts within it to be mounted MS_SLAVE. The difference is mostly
on the conceptual/philosophical level: configuring the propagation mode
is nothing people should have to think about, in particular as the
matter is not precisely easyto grok. Moreover, MountFlags= allows configuration
of "private" and "slave" modes which don't really make much sense to use
in real-life and are quite confusing. In particular PrivateMounts=private means
mounts made on the host stay pinned for good by the service which is
particularly nasty for removable media mount. And PrivateMounts=shared
is in most ways a NOP when used a alone...
The main technical difference between setting only MountFlags=slave or
only PrivateMounts=yes in a unit file is that the former remounts all
mounts to MS_SLAVE and leaves them there, while that latter remounts
them to MS_SHARED again right after. The latter is generally a nicer
approach, since it disables propagation, while MS_SHARED is afterwards
in effect, which is really nice as that means further namespacing down
the tree will get MS_SHARED logic by default and we unify how
applications see our mounts as we always pass them as MS_SHARED
regardless whether any mount namespacing is used or not.
The effect of PrivateMounts=yes was implied already by all the other
mount namespacing options. With this new option we add an explicit knob
for it, to request it without any other option used as well.
See: #4393
This drops needless safety checks that ensure we only reference block
devices for blockio/io settings. The backing code was already able to
accept regular file system paths too, in which case the backing device
node of that file system would be used. Hence, let's drop the artificial
restrictions and open up this underlying functionality.
Since bb28e68477 parsing failures of
certain unit file settings will result in load failures of units. This
introduces a new load state "bad-setting" that is entered in precisely
this case.
With this addition error messages on bad settings should be a lot more
explicit, as we don't have to show some generic "errno" error in that
case, but can explicitly say that a bad setting is at fault.
Internally this unit load state is entered as soon as any configuration
loader call returns ENOEXEC. Hence: config parser calls should return
ENOEXEC now for such essential unit file settings. Turns out, they
generally already do.
Fixes: #9107
The load_error is only valid in some load_state cases, lets generate
prettier messages for other cases too, by reusing the
bus_unit_validate_load_state() call which does jus that.
Clients (such as systemctl) ignored LoadError unles LoadState was
"error" before. With this change they could even show LoadError in other
cases and it would show a useful name.
Let's use a switch() statement, cover more cases with pretty messages.
Also let's rename it to "validate", as that's more specific that
"check", as it implies checking for a "valid"/"good" state, which is
what this function does.
This is mostly fall-out from d1a1f0aaf0,
however some cases are older bugs.
There might be more issues lurking, this was a simple grep for "%m"
across the tree, with all lines removed that mention "errno" at all.
Previously the enumerate() callback defined for each unit type would do
two things:
1. It would create perpetual units (i.e. -.slice, system.slice, -.mount and
init.scope)
2. It would enumerate units from /proc/self/mountinfo, /proc/swaps and
the udev database
With this change these two parts are split into two seperate methods:
enumerate() now only does #2, while enumerate_perpetual() is responsible
for #1. Why make this change? Well, perpetual units should have a
slightly different effect that those found through enumeration: as
perpetual units should be up unconditionally, perpetually and thus never
change state, they should also not pull in deps by their state changing,
not even when the state is first set to active. Thus, their state is
generally initialized through the per-device coldplug() method in
similar fashion to the deserialized state from a previous run would be
put into place. OTOH units found through regular enumeration should
result in state changes (and thus pull in deps due to state changes),
hence their state should be put in effect in the catchup() method
instead. Hence, given this difference, let's also separate the
functions, so that the rule is:
1. What is created in enumerate_perpetual() should be started in
coldplug()
2. What is created in enumerate() should be started in catchup().
This reworks how device units are "powered on".
This makes sure that any device changes that might have happened while
we were restarting/reloading will be noticed properly. For that we'll
now properly serialize/deserialize both the device unit state and the
device "found" flags, and restore these initially in the "coldplug"
phase of the manager deserialization. While enumerating the udev devices
during startup we'll put together a new "found" flags mask, which we'll
the switch to in the "catchup" phase of the manager deserialization,
which follows the "coldplug" phase.
Note that during the "coldplug" phase no unit state change events are
generated, which is different for the "catchall" phase which will do
that. Thus we correctly make sure that the deserialized state won't pull
in new deps, but any device's change while we were reloading would.
Fixes: #8832
Replaces: #8675
This is very similar to the existing unit method coldplug() but is
called a bit later. The idea is that that coldplug() restores the unit
state from before any prior reload/restart, i.e. puts the deserialized
state in effect. The catchup() call is then called a bit later, to
catch up with the system state for which we missed notifications while
we were reloading. This is only really useful for mount, swap and device
mount points were we should be careful to generate all missing unit
state change events (i.e. call unit_notify() appropriately) for
everything that happened while we were reloading.
let's drop the "now" argument, it's exactly what MANAGER_IS_RUNNING()
returns, hence let's use that instead to simplify things.
Moreover, let's change the add/found argument pair to become found/mask,
which allows us to change multiple flags at the same time into opposing
directions, which will be useful later on.
Also, let's change the return type to void. It's a notifier call where
callers will ignore the return value anyway as it is nothing actionable.
Should not change behaviour.
It's a bit prettier that day as the function won't silently overwrite
any possibly pre-initialized field, and destroy it right before we
allocate a new event source.
Scope units don't have a main or control process we can watch, hence
let's explicitly watch the PIDs contained in them early on, just to make
things more robust and have at least something to watch.
This reworks how systemd tracks processes on cgroupv1 systems where
cgroup notification is not reliable. Previously, whenever we had reason
to believe that new processes showed up or got removed we'd scan the
cgroup of the scope or service unit for new processes, and would tidy up
the list of PIDs previously watched. This scanning is relatively slow,
and does not scale well. With this change behaviour is changed: instead
of scanning for new/removed processes right away we do this work in a
per-unit deferred event loop job. This event source is scheduled at a
very low priority, so that it is executed when we have time but does not
starve other event sources. This has two benefits: this expensive work is
coalesced, if events happen in quick succession, and we won't delay
SIGCHLD handling for too long.
This patch basically replaces all direct invocation of
unit_watch_all_pids() in scope.c and service.c with invocations of the
new unit_enqueue_rewatch_pids() call which just enqueues a request of
watching/tidying up the PID sets (with one exception: in
scope_enter_signal() and service_enter_signal() we'll still do
unit_watch_all_pids() synchronously first, since we really want to know
all processes we are about to kill so that we can track them properly.
Moreover, all direct invocations of unit_tidy_watch_pids() and
unit_synthesize_cgroup_empty_event() are removed too, when the
unit_enqueue_rewatch_pids() call is invoked, as the queued job will run
those operations too.
All of this is done on cgroupsv1 systems only, and is disabled on
cgroupsv2 systems as cgroup-empty notifications are reliable there, and
we do not need SIGCHLD events to track processes there.
Fixes: #9138
This way we don't need to repeat the argument twice.
I didn't replace all instances. I think it's better to leave out:
- asserts
- comparisons like x & y == x, which are mathematically equivalent, but
here we aren't checking if flags are set, but if the argument fits in the
flags.
Previously, we'd not care about failures that were seen earlier and
remain in "exited" state. This could be triggered if the main process of
a service failed while ExecStartPost= was still running, as in that case
we'd not immediately act on the main process failure because we needed
to wait for ExecStartPost= to finish, before acting on it.
Fixes: #8929
The function is similar to path_kill_slashes() but also removes
initial './', trailing '/.', and '/./' in the path.
When the second argument of path_simplify() is false, then it
behaves as the same as path_kill_slashes(). Hence, this also
replaces path_kill_slashes() with path_simplify().
This adds a flags parameter to unit_notify() which can be used to pass
additional notification information to the function. We the make the old
reload_failure boolean parameter one of these flags, and then add a new
flag that let's unit_notify() if we are configured to restart the
service.
Note that this adjusts behaviour of systemd to match what the docs say.
Fixes: #8398
This corresponds nicely with the specifiers we already pass for
/var/lib, /var/cache, /run and so on.
This is particular useful to update the test-path service files to
operate without guessable files, thus allowing multiple parallel
test-path invocations to pass without issues (the idea is to set $TMPDIR
early on in the test to some private directory, and then only use the
new %T or %V specifier to refer to it).
The symbolic links to private directories specified by StateDirectory=
or its friends are created on the host. So, when DynamicUser= and
RootDirectory=/RootImage= are set, then the executed process cannot
access private directory.
This makes the private directories are mounted on the non-private place
when both DynamicUser= and RootDirectory=/RootImage= are set.
Fixes#8965.
Most our other parsing functions do this, let's do this here too,
internally we accept that anyway. Also, the closely related
load_env_file() and load_env_file_pairs() also do this, so let's be
systematic.
It is not const, because a) systemd can bump it on its own if
errors occur, and b) the user can change it using signals.
Also it's not boolean.
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager ShowStatus
b true
$ sudo kill -SIGRTMIN+21 1
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager ShowStatus
b false
Fixes#4503.
When DynamicUser= is set, then RuntimeDirectory= should be always
chowned, as the service unit may enable RuntimeDirectoryPreserve=,
and the uid or gid may changed from the last run.
This also makes easier to migrate the service to use DynamicUser=.
This makes most header files easier to look at. Also Emacs gets really
slow when browsing through large sections of overly long prototypes,
which is much improved by this macro.
We should probably not do something similar with too many other cases,
as macros like this might help readability for some, but make it worse
for others. But I think given the complexity of this specific prototype
and how often we use it, it's worth doing.
When "systemctl daemon-reload" is run at the same time as "systemctl
start foo", the latter might hang. That's because commands like start
wait for JobRemoved signal to know when the job is finished. But if the
job is finished during reloading, the signal is never sent.
The hang can be easily reproduced by running
# for ((N=1; N>0; N++)) ; do echo $N ; systemctl daemon-reload ; done
# for ((N=1; N>0; N++)) ; do echo $N ; systemctl start systemd-coredump.socket ; done
in two different terminals. The start command will hang after 1-2
iterations.
This keeps track of jobs that were started before reload and finished
during it and sends JobRemoved after the reload has finished.
And port config_parse_exec_oom_score_adjust() over to use it.
While we are at it, let's also fix config_parse_exec_oom_score_adjust()
to accept an empty string for turning off OOM score adjustments set
earlier.
That way we can use it in nspawn.
Also, while we are at it, let's rename the call config_parse_rlimit(),
i.e. insert the "r", to clarify what kind of limit this is about.
Commenting out "WatchdogTimeout=3min" in systemd-logind.service causes
NotifyAccess to go from "main" to "none", breaking support for logind
restart. Let's fix that.
We'd write a sequence that was invalid unicode and this caused the d-bus
connection to be terminated:
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_2esocket org.freedesktop.systemd1.Unit SubState
s "running"
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_e2socket org.freedesktop.systemd1.Unit SubState
Remote peer disconnected
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_e2socket org.freedesktop.systemd1.Unit SubState
(hangs)
Fixes#8978.
When I see "test", I have to think three times what the return value
means. With "below" this is immediately clear. ratelimit_below(&limit)
sounds almost like English and is imho immediately obvious.
(I also considered ratelimit_ok, but this strongly implies that being under the
limit is somehow better. Most of the times this is true, but then we use the
ratelimit to detect triple-c-a-d, and "ok" doesn't fit so well there.)
C.f. a1bcaa07.
This adds a number of entries nspawn already applies to regular service
namespacing too. Most importantly let's mask /proc/kcore and
/proc/kallsyms too.
if we lack privs to create device nodes that's fine, and creating
/run/systemd/inaccessible/chr or /run/systemd/inaccessible/blk won't
work then. Document this in longer comments.
Fixes: #4484
Scope units are populated from PIDs specified by the bus client. We do
that when a scope is started. We really shouldn't allow scopes to be
started multiple times, as the PIDs then might be heavily out of date.
Moreover, clients should have the guarantee that any scope they allocate
has a clear runtime cycle which is not repetitive.
Previously we were a bit sloppy with the index and size types of arrays,
we'd regularly use unsigned. While I don't think this ever resulted in
real issues I think we should be more careful there and follow a
stricter regime: unless there's a strong reason not to use size_t for
array sizes and indexes, size_t it should be. Any allocations we do
ultimately will use size_t anyway, and converting forth and back between
unsigned and size_t will always be a source of problems.
Note that on 32bit machines "unsigned" and "size_t" are equivalent, and
on 64bit machines our arrays shouldn't grow that large anyway, and if
they do we have a problem, however that kind of overly large allocation
we have protections for usually, but for overflows we do not have that
so much, hence let's add it.
So yeah, it's a story of the current code being already "good enough",
but I think some extra type hygiene is better.
This patch tries to be comprehensive, but it probably isn't and I missed
a few cases. But I guess we can cover that later as we notice it. Among
smaller fixes, this changes:
1. strv_length()' return type becomes size_t
2. the unit file changes array size becomes size_t
3. DNS answer and query array sizes become size_t
Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=76745
In particular, this confirms that the Found state needs to remain a bit-field:
$ systemd-analyze dump |grep 'Found: '|sort |uniq -c
105 Found: found-udev
3 Found: found-udev,found-mount
1 Found: found-udev,found-swap
(or in terms of the names of the actual files on disk, do not link
libsystemd-shared-238.a into libcore.a).
libsystemd_static is linked into libsystemd_shared, which in turn means that
anything that links to libcore and libsystemd_shared will get libsystemd_static
twice:
$ cc -o systemd 'systemd@exe/src_core_main.c.o' -Wl,--no-undefined -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -pie -DVALGRIND -Wl,--start-group src/core/libcore.a src/shared/libsystemd-shared-238.a src/shared/libsystemd-shared-238.so -pthread -lrt -lseccomp -lselinux -lmount -lblkid -Wl,--end-group -lseccomp -lpam -L/lib64 -laudit -lkmod -lmount -lrt -lcap -lacl -lcryptsetup -lgcrypt -lip4tc -lip6tc -lseccomp -lselinux -lidn -llzma -llz4 -lblkid '-Wl,-rpath,$ORIGIN/src/shared' -Wl,-rpath-link,/home/zbyszek/src/systemd/build/src/shared
This propagation of the dependency seems correct (in the sense that meson is
doing the expected thing based on the given configuration). Linking was done
this way in the original meson conversion. I was probably trying to get
everything to compile and link, I'm not sure why this particular choice was
made. In the meantime, meson has gotten better at propagating dependencies, so
it's possible that this had slightly different effect in the original
conversion, but I did not verify this. Either way, I think we should drop this.
With the patch:
$ cc -o systemd 'systemd@exe/src_core_main.c.o' -Wl,--no-undefined -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -pie -DVALGRIND -Wl,--start-group src/core/libcore.a src/shared/libsystemd-shared-238.so -pthread -lrt -lseccomp -lselinux -lmount -Wl,--end-group -lblkid -lrt -lseccomp -lpam -L/lib64 -laudit -lkmod -lselinux -lmount '-Wl,-rpath,$ORIGIN/src/shared' -Wl,-rpath-link,/home/zbyszek/src/systemd/build/src/shared
This is more correct because we're not linking the same code twice.
With the patch, libystemd_static is used in exactly four places:
- src/shared/libsystemd-shared-238.so
- src/udev/libudev.so.1.6.10
- pam_systemd.so
- test-bus-error
(compared to a bunch more executables before, including systemd,
systemd-analyze, test-hostname, test-ns, etc.)
Size savings are also noticable:
$ size /var/tmp/inst?/usr/lib/systemd/libsystemd-shared-238.so
text data bss dec hex filename
2397826 578488 15920 2992234 2da86a /var/tmp/inst1/usr/lib/systemd/libsystemd-shared-238.so
2397826 578488 15920 2992234 2da86a /var/tmp/inst2/usr/lib/systemd/libsystemd-shared-238.so
$ size /var/tmp/inst?/usr/lib/systemd/systemd
text data bss dec hex filename
1858790 261688 9320 2129798 207f86 /var/tmp/inst1/usr/lib/systemd/systemd
1556358 258704 8072 1823134 1bd19e /var/tmp/inst2/usr/lib/systemd/systemd
$ du -s /var/tmp/inst?
52216 /var/tmp/inst1
50844 /var/tmp/inst2
https://github.com/google/oss-fuzz/issues/1330#issuecomment-384054530 might be related.
This is useful for foreign container runtimes implementing the OCI
runtime spec, which only wants to deal with cgroup paths. There's
already an API to translate units into cgroup paths, with this we add
the reverse.
This drops a good number of type-specific _cleanup_ macros, and patches
all users to just use the generic ones.
In most recent code we abstained from defining type-specific macros, and
this basically removes all those added already, with the exception of
the really low-level ones.
Having explicit macros for this is not too useful, as the expression
without the extra macro is generally just 2ch wider. We should generally
emphesize generic code, unless there are really good reasons for
specific code, hence let's follow this in this case too.
Note that _cleanup_free_ and similar really low-level, libc'ish, Linux
API'ish macros continue to be defined, only the really high-level OO
ones are dropped. From now on this should really be the rule: for really
low-level stuff, such as memory allocation, fd handling and so one, go
ahead and define explicit per-type macros, but for high-level, specific
program code, just use the generic _cleanup_() macro directly, in order
to keep things simple and as readable as possible for the uninitiated.
Note that before this patch some of the APIs (notable libudev ones) were
already used with the high-level macros at some places and with the
generic _cleanup_ macro at others. With this patch we hence unify on the
latter.
I'm not sure if I understand the original code. AFAICS, errno does not
have to be set at all in this callback.
ratelimit_test() returns positive if we are under limit. The code would only
log if the condition happened very often, which I assume is not inteded, and
this check was supposed to prevent too much logging.
Those are quite similar to %i/%I, but refer to the last dash-separated
component of the name prefix.
The new functionality of dash-dropins could largely supersede the template
functionality, so it would be tempting to overload %i/%I. But that would
not be backwards compatible. So let's add the two new letters instead.
Do not try to party initialize a device during deserialization if it's not
known by udev (anymore) and therefore hasn't been seen during device
enumeration.
The device unit in this case has not been initialized properly and setting it
in the "plugged" state can be confusing.
Actually this happens during every boots when PID switches to the new rootfs:
PID is reexecuted and enumerates devices but since udev is not running, the
list of enumerated devices is empty.
PID1 updates the state of device units upon 2 different events:
- when it processes an event sent by udev and in this case the device deps are
started if the device enters in the "plugged" state.
- when it enumerates all devices during its startup or when it is asked to
reload its configuration data but in this case the device deps (if any) are
not retroactively started.
When udev processes a new "add" kernel event, it first registers the new device
in its databases then sends an event to systemd.
If for any reason, systemd is asked to reload its configuration between the
previous 2 steps, it might see for the first time the new device while scanning
/sys for all devices. Only during a second step, udev will send the event for
the new device.
In this peculiar case the device deps wont be started (even though the device
is first seen by PID1).
Indeed when reloading its configurations, PID1 will put the device unit in the
"plugged" state but without starting the device deps. Thereafter PID1 will get
the event from udev for the new device but the device unit will be in "plugged"
state already therefore it won't see any need to start the device dependencies.
Rather than assuming that during the reloading of systemd manager configuration
all devices listed in udev DBs have been already processed and should be put in
the "plugged" state (done by device_coldplug()), this patch does that only for
devices which have been processed via an udev event (device_dispatch_io())
previously. In this case we set "d->found" to "DEVICE_FOUND_UDEV" and we make
also sure to no more initialize "d->found" while enumerating devices. Instead
this field is now saved/restored while devices are serialized.
Double newlines (i.e. one empty lines) are great to structure code. But
let's avoid triple newlines (i.e. two empty lines), quadruple newlines,
quintuple newlines, …, that's just spurious whitespace.
It's an easy way to drop 121 lines of code, and keeps the coding style
of our sources a bit tigther.
We check the same condition at various places. Let's add a trivial,
common helper for this, and use it everywhere.
It's not going to make things much faster or much shorter, but I think a
lot more readable
Before this patch we'd resolve all symlinks of bind mounts and other
mount points to establish for a service in advance, and only then start
mounting them. This is problematic, if symlink chains jump around
between directories in a namespace tree, so that to resolve a specific
symlink chain we need to establish another mount already. A typical case
where this happens is if /etc/resolv.conf is a symlink to some file in
/run: in that case we'd normally resolve and mount /etc/resolv.conf
early on, but that's broken, as to do this properly we'd need to resolve
/etc/resolv.conf first, then figure out that /run needs to be mounted
before we can proceed, and thus reorder the order in which we apply
mounts dynamically.
With this change, whenever we are about to apply a mount, we'll do a
single step of the symlink normalization process, patch the mount entry
accordingly, and then sort the list of mounts to establish again, taking
the new path into account. This means that we can correctly deal with
the example above: we might start with wanting to mount /etc/resolv.conf
early, but after resolving it to the path in /run/ we'd push it to the
end of the list, ensuring that /run is mounted first.
(Note that this also fixes another bug: we were following symlinks on
the bind mount source relative to the root directory of the service,
rather than of the host. That's wrong though as we explicitly document
tha the source of bind mounts is always on the host.)
Absolute paths make everything simple and quick, but sometimes this requirement
can be annoying. A good example is calling 'test', which will be located in
/usr/bin/ or /bin depending on the distro. The need the provide the full path
makes it harder a portable unit file in such cases.
This patch uses a fixed search path (DEFAULT_PATH which was already used as the
default value of $PATH), and if a non-absolute file name is found, it is
immediately resolved to a full path using this search path when the unit is
loaded. After that, everything behaves as if an absolute path was specified. In
particular, the executable must exist when the unit is loaded.
Returning 0 on not-found/wrong-type is confusing. Let's return -ENXIO in that
case instead, and explicitly ignore it in the call site where we want to do that.
I think this is clearer and less likely to be used errenously in case another
call site is added.
C.f. 152c475f95 and 98b1d2b8d9.
In environments where CAP_MKNOD is not available or inside
user namespaces it is still desirable to enable services to use
PrivateDevices= . So fall back to using bind-mounts on EPERM.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
sd_bus_open/sd_bus_open_system/sd_bus_open_user are convenient, but
don't allow the description to be set. After they return, the bus is
is already started, and sd_bus_set_description() fails with -EBUSY.
It would be possible to allow sd_bus_set_description() to update the
description "live", but messages are already emitted from sd_bus_open
functions, so it's better to allow the description to be set in
sd_bus_open/sd_bus_open_system/sd_bus_open_user.
Fixes message like:
Bus n/a: changing state UNSET → OPENING
The initial fix for relabelling the cgroup filesystem for
SELinux delivered in commit 8739f23e3 was based on the assumption that
the cgroup filesystem is already populated once mount_setup() is
executed, which was true for my system. What I wasn't aware is that this
is the case only when another instance of systemd was running before
this one, which can happen if systemd is used in the initrd (for ex. by
dracut).
In case of a clean systemd start-up the cgroup filesystem is actually
being populated after mount_setup() and does not need relabelling as at
that moment the SELinux policy is already loaded. Since however the root
cgroup filesystem was remounted read-only in the meantime this operation
will now fail.
To fix this check for the filesystem mount flags before relabelling and
only remount ro->rw->ro if necessary and leave the filesystem read-write
otherwise.
Fixes#7901.
Re-use the hacks used to link user keyring, when creating the session
keyring. This way changing ownership of the keyring is not required, and thus
incovation_id can be correctly created in restricted environments.
Creating invocation_id with root permissions works and linking it into session
keyring works, as at that point session keyring is possessed.
Simple way to validate this is with following commands:
$ journalctl -f &
$ sudo systemd-run --uid 1000 /bin/sh -c 'keyctl describe @s; keyctl list @s; keyctl read `keyctl search @s user invocation_id`'
which now works in LXD containers as well as on the host.
Fixes: https://github.com/systemd/systemd/issues/7655
This reworks the SELinux and SMACK label fixing calls in a number of
ways:
1. The two separate boolean arguments of these functions are converted
into a flags type LabelFixFlags.
2. The operations are now implemented based on O_PATH. This should
resolve TTOCTTOU races between determining the label for the file
system object and applying it, as it it allows to pin the object
while we are operating on it.
3. When changing a label fails we'll query the label previously set, and
if matches what we want to set anyway we'll suppress the error.
Also, all calls to label_fix() are now (void)ified, when we ignore the
return values.
Fixes: #8566
linux/fs.h sys/mount.h, libmount.h and missing.h all include MS_*
definitions.
To avoid problems, only one of linux/fs.h, sys/mount.h and libmount.h
should be included. And missing.h must be included last.
Without this, building systemd may fail with:
In file included from [...]/libmount/libmount.h:31:0,
from ../systemd-238/src/core/manager.h:23,
from ../systemd-238/src/core/emergency-action.h:37,
from ../systemd-238/src/core/unit.h:34,
from ../systemd-238/src/core/dbus-timer.h:25,
from ../systemd-238/src/core/timer.c:26:
[...]/sys/mount.h:57:2: error: expected identifier before numeric constant
Currently we add target dependencies while we are loading units. This
can create ordering loops even if configuration doesn't contain any
loop. Take for example following configuration,
$ systemctl get-default
multi-user.target
$ cat /etc/systemd/system/test.service
[Unit]
After=default.target
[Service]
ExecStart=/bin/true
[Install]
WantedBy=multi-user.target
If we encounter such unit file early during manager start-up (e.g. load
queue is dispatched while enumerating devices due to SYSTEMD_WANTS in
udev rules) we would add stub unit default.target and we order it Before
test.service. At the same time we add implicit Before to
multi-user.target. Later we merge two units and we create ordering cycle
in the process.
To fix the issue we will now never add any target dependencies until we
loaded all the unit files and resolved all the aliases.
When we are attempting to create directory somewhere in the bowels of /var/lib
and get an error that it already exists, it can be quite hard to diagnose what
is wrong (especially for a user who is not aware that the directory must have
the specified owner, and permissions not looser than what was requested). Let's
print a warning in most cases. A warning is appropriate, because such state is
usually a sign of borked installation and needs to be resolved by the adminstrator.
$ build/test-fs-util
Path "/tmp/test-readlink_and_make_absolute" already exists and is not a directory, refusing.
(or)
Directory "/tmp/test-readlink_and_make_absolute" already exists, but has mode 0775 that is too permissive (0755 was requested), refusing.
(or)
Directory "/tmp/test-readlink_and_make_absolute" already exists, but is owned by 1001:1000 (1000:1000 was requested), refusing.
Assertion 'mkdir_safe(tempdir, 0755, getuid(), getgid(), MKDIR_WARN_MODE) >= 0' failed at ../src/test/test-fs-util.c:320, function test_readlink_and_make_absolute(). Aborting.
No functional change except for the new log lines.
This is similar to TAKE_PTR() but operates on file descriptors, and thus
assigns -1 to the fd parameter after returning it.
Removes 60 lines from our codebase. Pretty good too I think.
This macro will read a pointer of any type, return it, and set the
pointer to NULL. This is useful as an explicit concept of passing
ownership of a memory area between pointers.
This takes inspiration from Rust:
https://doc.rust-lang.org/std/option/enum.Option.html#method.take
and was suggested by Alan Jenkins (@sourcejedi).
It drops ~160 lines of code from our codebase, which makes me like it.
Also, I think it clarifies passing of ownership, and thus helps
readability a bit (at least for the initiated who know the new macro)
This rearranges chase_symlinks() a bit: if no special flags are
specified it will now revert to behaviour before
b12d25a8d6. However, if the new
CHASE_TRAIL_SLASH flag is specified it will follow the behaviour
introduced by that commit.
I wasn't sure which one to make the beaviour that requires specification
of a flag to enable. I opted to make the "append trailing slash"
behaviour the one to enable by a flag, following the thinking that the
function should primarily be used to generate a normalized path, and I
am pretty sure a path without trailing slash is the more "normalized"
one, as the trailing slash is not really a part of it, but merely a
"decorator" that tells various system calls to generate ENOTDIR if the
path doesn't refer to a path.
Or to say this differently: if the slash was part of normalization then
we really should add it in all cases when the final path is a directory,
not just when the user originally specified it.
Fixes: #8544
Replaces: #8545
The warning is not emitted for absolute paths like /dev/sda or /home, which are
converted to .device and .mount unit names without any fuss.
Most of the time it's unlikely that users use invalid unit names on purpose,
so let's warn them. Warnings are silenced when --quiet is used.
$ build/systemctl show -p Id hello@foo-bar/baz
Invalid unit name "hello@foo-bar/baz" was escaped as "hello@foo-bar-baz" (maybe you should use systemd-escape?)
Id=hello@foo-bar-baz.service
$ build/systemd-run --user --slice foo-bar/baz --unit foo-bar/foo true
Invalid unit name "foo-bar/foo" was escaped as "foo-bar-foo" (maybe you should use systemd-escape?)
Invalid unit name "foo-bar/baz" was escaped as "foo-bar-baz" (maybe you should use systemd-escape?)
Running as unit: foo-bar-foo.service
Fixes#8302.
Let's better check this inside of the call than before it, so that we
never issue this while reloading, even should these calls be called due
to other reasons than just the unit notify.
This makes sure the reload state is unset a bit earlier in
manager_reload() so that we can safely call this function from there and
they do the right thing.
Follow-up for e63ebf71ed.
No need to go through the specifier_printf() if the path is already too long in
the unexpanded form (since specifiers increase the length of the string in all
practical cases).
In the oss-fuzz test case, valgrind reports:
total heap usage: 179,044 allocs, 179,044 frees, 72,687,755,703 bytes allocated
and the original config file is ~500kb. This isn't really a security issue,
since the config file has to be trusted any way, but just a matter of
preventing accidental resource exhaustion.
https://oss-fuzz.com/v2/issue/4651449704251392/6977
While at it, fix order of arguments in the neighbouring log_syntax() call.