This tries to address the "bind"/"unbind" uevent kernel API breakage, by
changing the semantics of device tags.
Previously, tags would be applied on uevents (and the database entries
they result in) only depending on the immediate context. This means that
if one uevent causes the tag to be set and the next to be unset, this
would immediately effect what apps would see and the database entries
would contain each time. This is problematic however, as tags are a
filtering concept, and if tags vanish then clients won't hence notice
when a device stops being relevant to them since not only the tags
disappear but immediately also the uevents for it are filtered including
the one necessary for the app to notice that the device lost its tag and
hence relevance.
With this change tags become "sticky". If a tag is applied is once
applied to a device it will stay in place forever, until the device is
removed. Tags can never be removed again. This means that an app
watching a specific set of devices by filtering for a tag is guaranteed
to not only see the events where the tag is set but also all follow-up
events where the tags might be removed again.
This change of behaviour is unfortunate, but is required due to the
kernel introducing new "bind" and "unbind" uevents that generally have
the effect that tags and properties disappear and apps hence don't
notice when a device looses relevance to it. "bind"/"unbind" events were
introduced in kernel 4.12, and are now used in more and more subsystems.
The introduction broke userspace widely, and this commit is an attempt
to provide a way for apps to deal with it.
While tags are now "sticky" a new automatic device property
CURRENT_TAGS is introduced (matching the existing TAGS property) that
always reflects the precise set of tags applied on the most recent
events. Thus, when subscribing to devices through tags, all devices that
ever had the tag put on them will be be seen, and by CURRENT_TAGS it may
be checked whether the device right at the moment matches the tag
requirements.
See: #7587#7018#8221
When invoking "ldd" to find dependency libraries we already set
$LD_LIBRARY_PATH to point to our own build tree, so that our libraries
are checked, not the host libraries. This is not sufficient howeever, as
libudev is built in a subdir. Add that, too.
Let's make sure the sd_listen_fd() docs are really found from the
.socket file documentation as well as the FileDescriptorStoreMax=
documentation.
Let's also emphasize that that's where the order in which the fds are
passed are documented.
Fixes: #16647
Desired functionality:
Set securebits for services started as non-root user.
Failure:
The starting of the service fails if no ambient capability shall be
raised.
... systemd[217941]: ...: Failed to set process secure bits: Operation
not permitted
... systemd[217941]: ...: Failed at step SECUREBITS spawning
/usr/bin/abc.service: Operation not permitted
... systemd[1]: abc.service: Failed with result 'exit-code'.
Reason:
For setting securebits the capability CAP_SETPCAP is required. However
the securebits (if no ambient capability shall be raised) are set after
setresuid.
When setresuid is invoked all capabilities are dropped from the
permitted, effective and ambient capability set. If the securebit
SECBIT_KEEP_CAPS is set the permitted capability set is retained, but
the effective and the ambient set are cleared.
If ambient capabilities shall be set, the securebit SECBIT_KEEP_CAPS is
added to the securebits configured in the service file and set together
with the securebits from the service file before setresuid is executed
(in enforce_user).
Before setresuid is executed the capabilities are the same as for pid1.
This means that all capabilities in the effective, permitted and
bounding set are set. Thus the capability CAP_SETPCAP is in the
effective set and the prctl(PR_SET_SECUREBITS, ...) succeeds.
However, if the secure bits aren't set before setresuid is invoked they
shall be set shortly after the uid change in enforce_user.
This fails as SECBIT_KEEP_CAPS wasn't set before setresuid and in
consequence the effective and permitted set was cleared, hence
CAP_SETPCAP is not set in the effective set (and cannot be raised any
longer) and prctl(PR_SET_SECUREBITS, ...) failes with EPERM.
Proposed solution:
The proposed solution consists of three parts
1. Check in enforce_user, if securebits are configured in the service
file. If securebits are configured, set SECBIT_KEEP_CAPS
before invoking setresuid.
2. Don't set any other securebits than SECBIT_KEEP_CAPS in enforce_user,
but set all requested ones after enforce_user.
This has the advantage that securebits are set at the same place for
root and non-root services.
3. Raise CAP_SETPCAP to the effective set (if not already set) before
setting the securebits to avoid EPERM during the prctl syscall.
For gaining CAP_SETPCAP the function capability_bounding_set_drop is
splitted into two functions:
- The first one raises CAP_SETPCAP (required for dropping bounding
capabilities)
- The second drops the bounding capabilities
Why are ambient capabilities not affected by this change?
Ambient capabilities get cleared during setresuid, no matter if
SECBIT_KEEP_CAPS is set or not.
For raising ambient capabilities for a user different to root, the
requested capability has to be raised in the inheritable set first. Then
the SECBIT_KEEP_CAPS securebit needs to be set before setresuid is
invoked. Afterwards the ambient capability can be raised, because it is
in the inheritable and permitted set.
Security considerations:
Although the manpage is ambiguous SECBIT_KEEP_CAPS is cleared during
execve no matter if SECBIT_KEEP_CAPS_LOCKED is set or not. If both are
set only SECBIT_KEEP_CAPS_LOCKED is set after execve.
Setting SECBIT_KEEP_CAPS in enforce_user for being able to set
securebits is no security risk, as the effective and permitted set are
set to the value of the ambient set during execve (if the executed file
has no file capabilities. For details check man 7 capabilities).
Remark:
In capability-util.c is a comment complaining about the missing
capability CAP_SETPCAP in the effective set, after the kernel executed
/sbin/init. Thus it is checked there if this capability has to be raised
in the effective set before dropping capabilities from the bounding set.
If this were true all the time, ambient capabilities couldn't be set
without dropping at least one capability from the bounding set, as the
capability CAP_SETPCAP would miss and setting SECBIT_KEEP_CAPS would
fail with EPERM.
Up to now the capability CAP_SETPCAP was raised implicitly in the
function capability_bounding_set_drop.
This functionality is moved into a new function
(capability_gain_cap_setpcap).
The new function optionally provides the capability set as it was
before raisining CAP_SETPCAP.
Instead of assuming that more-recently modified directories have higher mtime,
just look for any mtime changes, up or down. Since we don't want to remember
individual mtimes, hash them to obtain a single value.
This should help us behave properly in the case when the time jumps backwards
during boot: various files might have mtimes that in the future, but we won't
care. This fixes the following scenario:
We have /etc/systemd/system with T1. T1 is initially far in the past.
We have /run/systemd/generator with time T2.
The time is adjusted backwards, so T2 will be always in the future for a while.
Now the user writes new files to /etc/systemd/system, and T1 is updated to T1'.
Nevertheless, T1 < T1' << T2.
We would consider our cache to be up-to-date, falsely.
This check was added in d904afc730. It would only
apply in the case where the cache hasn't been loaded yet. I think we pretty
much always have the cache loaded when we reach this point, but even if we
didn't, it seems better to try to reload the unit. So let's drop this check.
We really only care if the cache has been reloaded between the time when we
last attempted to load this unit and now. So instead of recording the actual
time we try to load the unit, just store the timestamp of the cache. This has
the advantage that we'll notice if the cache mtime jumps forward or backward.
Also rename fragment_loadtime to fragment_not_found_time. It only gets set when
we failed to load the unit and the old name was suggesting it is always set.
In https://bugzilla.redhat.com/show_bug.cgi?id=1871327
(and most likely https://bugzilla.redhat.com/show_bug.cgi?id=1867930
and most likely https://bugzilla.redhat.com/show_bug.cgi?id=1872068) we try
to load a non-existent unit over and over from transaction_add_job_and_dependencies().
My understanding is that the clock was in the future during inital boot,
so cache_mtime is always in the future (since we don't touch the fs after initial boot),
so no matter how many times we try to load the unit and set
fragment_loadtime / fragment_not_found_time, it is always higher than cache_mtime,
so manager_unit_cache_should_retry_load() always returns true.
The name is misleading, since we aren't really loading the unit from cache — if
this function returns true, we'll try to load the unit from disk, updating the
cache in the process.
This seems to be overridable by setting the SYSTEMD_HOMEWORK_PATH env
variable, but the error message always printed the SYSTEMD_HOMEWORK_PATH
constant.
N_DEVICE_NODE_LIST_ATTEMPTS is unconditionally used since version 246 and
ac1f3ad05f
However, this variable is only defined if HAVE_BLKID is set resulting in
the following build failure if cryptsetup is enabled but not libblkid:
../src/shared/dissect-image.c:1336:34: error: 'N_DEVICE_NODE_LIST_ATTEMPTS' undeclared (first use in this function)
1336 | for (unsigned i = 0; i < N_DEVICE_NODE_LIST_ATTEMPTS; i++) {
|
Fixes:
- http://autobuild.buildroot.org/results/67782c225c08387c1bbcbea9eee3ca12bc6577cd
We must have the error number around when completing the transaction.
Let's hence make sure we always initialize it *first* (we accidentally
did it once after).
Fixes: #11626
With the changes from 2c0dffe82d, starting
systemd-networkd.service will also activate systemd-networkd.socket.
When tearing down a test, we need to stop the socket as well, to make
sure networkd can't be activated accidentally with the wrong
configuration.
On systems without an RTC, systemd currently sets the clock to a
compile-time epoch value, derived from the NEWS file in the
repository. This is not ideal as the initial clock hence depends
on the last time systemd was built, not when the image was compiled.
Let's provide a different way here and look at `/usr/lib/clock-epoch`.
If that file exists, it's timestamp for the last modification will be
used instead of the compile-time default.
Let's document the discrepancy between the Sec and USec suffixing of
unit files and D-Bus properties at three places: in "systemctl show"
(where it already was briefly mentioned), in the D-Bus interface
description (at one place at least, i.e. the most prominent of
properties that encapsulate time values, there are many more) and in the
general man page explaining time values.
By documenting this at all three places I think we now do as much as we
can do about this highlighting the discrepancy of the naming and the
reasons behind it.
Fixes: #2047
We need to include `<sys/stat.h>` for usage of the `struct stat` in
the Manager struct, much as we already include `<stdbool.h>` for C99
booleans.
This helps alleviate another minor build failure on non-glibc systems.
This allows us to properly detect mount points, for free. (Also, allows
us to respect btimes that are newer than the cutoff, which should be
useful when people untar file trees in /var/tmp)
Fixes: #16848
This file must be included on non-glibc systems to ensure
the `LOCK_EX` definition is available.
Signed-off-by: Ikey Doherty <ikey.doherty@lispysnake.com>
Any uevent other then the initial and the last uevent we see for a
device (which is "add" and "remove") should result in a reload being
triggered, including "bind" and "unbind". Hence, let's fix up the check.
("move" is kinda a combined "remove" + "add", hence cover that too)
As discussed in https://gitlab.freedesktop.org/libinput/libinput/-/issues/521, it adds a narrower
match that only applies to X240. Other laptops that match `pvrThinkPad??40` are not affected:
$ systemd-hwdb query 'evdev:name:SynPS/2 Synaptics TouchPad:dmi:*svnLENOVO*:pvrThinkPadX240:*'
EVDEV_ABS_00=1232:5711:51
EVDEV_ABS_01=1159:4700:53
EVDEV_ABS_35=1232:5711:51
EVDEV_ABS_36=1159:4700:53
$ systemd-hwdb query 'evdev:name:SynPS/2 Synaptics TouchPad:dmi:*svnLENOVO*:pvrThinkPadX140:*'
EVDEV_ABS_00=::41
EVDEV_ABS_01=::37
EVDEV_ABS_35=::41
EVDEV_ABS_36=::37