The usage in unit_get_own_mask is redundant, we only need apply
disable_mask at the end befor application, i.e. calculating enable or
target mask.
(IOW, we allow all configurations, but disabling affects effective
controls.)
Modify tests accordingly and add testing of enable mask.
This is intended as cleanup, with no effect but changing unit_dump
output.
The unit_add_siblings_to_cgroup_realize_queue does more than mere
siblings queueing, hence define a family of a unit as (immediate)
children of the unit and immediate children of all ancestors.
Working with this abstraction simplifies the queuing calls and it
shouldn't change the functionality.
Merge members mask invalidation into
unit_add_siblings_to_cgroup_realize_queue, this way unit_realize_cgroup
needn't be called with members mask invalidation.
We have to retain the members mask invalidation in unit_load -- although
active units would have cgroups (re)realized (unit_load queues for
realization), the realization would happen with potentially stale mask.
unit_free(u) realizes direct parent and invalidates members mask of all
ancestors. This isn't sufficient in v1 controller hierarchies since
siblings of the freed unit may have existed only because of the removed
unit.
We cannot be lazy about the siblings because if parent(u) is also
removed, it'd migrate and rmdir cgroups for siblings(u). However,
realized masks of siblings(u) won't reflect this change.
This was a non-issue earlier, because we weren't removing cgroup
directories properly (effectively matching the stale realized mask),
removal failed because of tasks left by missing migration (see previous
commit).
Therefore, ensure realization of all units necessary to clean up after
the free'd unit.
Fixes: #14149
When we are about to derealize a controller on v1 cgroup, we first
attempt to delete the controller cgroup and migrate afterwards. This
doesn't work in practice because populated cgroup cannot be deleted.
Furthermore, we leave out slices from migration completely, so
(un)setting a control value on them won't realize their controller
cgroup.
Rework actual realization, unit_create_cgroup() becomes
unit_update_cgroup() and make sure that controller hierarchies are
reduced when given controller cgroup ceased to be needed.
Note that with this we introduce slight deviation between v1 and v2 code
-- when a descendant unit turns off a delegated controller, we attempt
to disable it in ancestor slices. On v2 this may fail (kernel enforced,
because of child cgroups using the controller), on v1 we'll migrate
whole subtree and trim the subhierachy. (Previously, we wouldn't take
away delegated controller, however, derealization was broken anyway.)
Fixes: #14149
When available, enable memory_recursiveprot. Realistically it always
makes sense to delegate MemoryLow= and MemoryMin= to all children of a
slice/unit.
The kernel option is not enabled by default as it might cause
regressions in some setups. However, it is the better default in
general, and it results in a more flexible and obvious behaviour.
The alternative to using this option would be for user's to also set
DefaultMemoryLow= on slices when assigning MemoryLow=. However, this
makes the effect of MemoryLow= on some children less obvious, as it
could result in a lower protection rather than increasing it.
From the kernel documentation:
memory_recursiveprot
Recursively apply memory.min and memory.low protection to
entire subtrees, without requiring explicit downward
propagation into leaf cgroups. This allows protecting entire
subtrees from one another, while retaining free competition
within those subtrees. This should have been the default
behavior but is a mount-option to avoid regressing setups
relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).
This was added in kernel commit 8a931f801340c (mm: memcontrol:
recursive memory.low protection), which became available in 5.7 and was
subsequently fixed in kernel 5.7.7 (mm: memcontrol: handle div0 crash
race condition in memory.low).
It is possible that we will be running with an upgraded libseccomp, in which
case libseccomp might know the syscall name, even if the number is not known at
the time when systemd is being compiled. The guard only serves to break such
upgrades, by requiring that we also recompile systemd.
For s390-specific syscalls, use a define to exclude them, so that that we don't
try to filter them on other arches.
(cherry picked from commit 6cf852e79eb0eced2f77653941f9c75c3bd79386)
The kernel finally has a proper API to determine the mnt_id of a file.
Let's use it.
This adds support for the STATX_MNT_ID field of statx(), added in
kernel 5.8.
Also, let's move the glue for this to src/shared/ so that we later can
reuse this in sysemd-firstboot.
Given that libpwquality is a more a leaf dependency, let's make it
runtime optional, so that downstream distros can downgrade their package
deps from Required to Recommended.
The cryptsetup context pins the loop device even after deactivation.
Let's explicitly release the context to make sure the subsequent
loopback device detaching works cleanly.
This how this works on Linux: when atomically creating a file we need to
fully populate it under a temporary name and then when we are fully
done, sync it and the directory it is contained in, before renaming it
to the final name.
This was needed before commit 16e4fd87c5 added a
mode that opens the log fds for every single log message. This mode is used in
execute.c since then making the explicit call to log_open unnecessary.
This basically reverts ea89a119cd.
glibc 2.26 lifted restrictions on search domains count or length to
unlimited. This has also been backported to 2.17 in some distributions (RHEL 7
and derivatives). Other softwares may have their own limits for search domains,
but we should not restrict what is written out any more.
https://sourceware.org/legacy-ml/libc-announce/2017/msg00001.html
This comes up occasionally with new users. The phrase "Logs begin ..." is
ambiguous because it can be taken to mean the logs being displayed or all logs
(the intended meaning). Let's rephrase this as "Journal begins ..." to make
this clearer.
This information was already available in the debug output, but I think it
is good to include it in the message in the table. This makes it easier to wrap
one's head around the allowlist/denylist filtering.
Unprivileged test-fs-util fails on my system since /sys/dev/block is
inaccessible for unprivileged users, so let's skip encrypted path test if we
get EACCES or similar.
If we don't succeed on the first try it's because another process is
opening the same device. Do a microsleep for 2ms to increase the
chances it has completed the next time around the loop.
The symlink /dev/mapper/dm_name is created by udev after a mapper
device is set up. So libdevmapper/libcrypsetup might tell us that
a verity device exists, but the symlink we use as the source for
the mount operation might not be there yet.
Instead of falling back to a new unique device set up, wait for
the udev event matching on the expected devlink for at least 100ms
(after which the benefits of sharing a device in terms of setup
time start to disappear - on my production machines, opening a new
verity device seems to take between 150ms and 300ms)
This effectively makes little difference because we exit soon later
anyway, which will close the fds, too. However, it's still useful since
it means the parent will get EOF events on them in the order we process
things and isn't delayed to process the data from the pipes until the
child dies.
Let's use a proper table for outputting partition information. Let's
also put the general information about the image first, and the table
after that.
Moreover, dissect the image before showing any output, so that we can
early on return an error if the image is not valid.
That way we can turn off kernel partition scanning if verity data is
available (as we don't support verity for full GPT images, only for
simple file system images).
LOOP_CONFIGURE allows us to configure a loopback device in one ioctl
instead of two, which is not just faster but also removes the race that
udev might start probing the device before we adjusted things properly.
Unfortunately LOOP_CONFIGURE is broken in regards to LO_FLAGS_PARTSCAN
as of kernel 5.8.0. This patch contains a work-around for that, to
fallback to old behaviour if partition scanning is requested but does
not work. Sucks a bit.
Proposed upstream fix for that issue:
https://lkml.org/lkml/2020/8/6/97
The machine-id is used to create a few directories and setup a default
loader entry in loader.conf. Having bootctl create the directories
itself is not particularly useful as it does not put anything in them
and bootctl install is not guaranteed to be called before an initramfs
tool like kernel-install so other programs will always need to have
logic to create the directories themselves if they happen to be called
before bootctl install is called.
On top of this, when using unified kernel images, these are installed to
$BOOT/EFI/Linux which removes the need to have the directories created
by bootctl at all. This further indicates that these directories should
be created by the program that puts something in them rather than by
bootctl.
Removing the machine-id dependency allows bootctl install to be called
even when there's no machine-id in the image. This is useful for image
builders such as mkosi which don't have a machine-id when
installing systemd-boot (via bootctl) because it should only be
generated by systemd when the final image is booted.
The default entry in loader.conf based on the machine-id in loader.conf
is also removed which shouldn't be a massive loss in usability overall.
This commit reverts commit 341890d.
Let's fix up invalid GECOS fields both when we convert from NSS to JSON
and the other way round.
Kinda sucks we have to do that, but NSS does it when writing data to
/etc/passwd, so let's do the same.
Fixes: #16668
User records have the realname/gecos fields, groups never had that, but
it would really be useful to have it, hence let's add it with similar
semantics.
We enforce the same syntax as for GECOS, since it's better to start with
strict rules and losen them later instead of the opposite.
The commit 1070d271fa which was supposed
too fix this does not seem to take effect any more. We get again 34%
compilation success rate while scanning systemd itself. Moreover, the
installed header file breaks compilation of programs that include it:
"/usr/include/systemd/_sd-common.h", line 23: error #35: #error directive: "Do
not include _sd-common.h directly; it is a private header."
# error "Do not include _sd-common.h directly; it is a private header."
^
Follows the same pattern and features as RootImage, but allows an
arbitrary mount point under / to be specified by the user, and
multiple values - like BindPaths.
Original implementation by @topimiettinen at:
https://github.com/systemd/systemd/pull/14451
Reworked to use dissect's logic instead of bare libmount() calls
and other review comments.
Thanks Topi for the initial work to come up with and implement
this useful feature.
Given a string in the format 'one:two three four:five', returns a string
vector with each word. If the second element of the tuple is not
present, an empty string is returned in its place, so that the vector
can be processed in pairs.
[zjs: use EXTRACT_UNESCAPE_SEPARATORS instead of EXTRACT_CUNESCAPE_RELAX.
This way we do escaping exactly once and in normal strict mode.]
This code was changed in this pull request:
https://github.com/systemd/systemd/pull/16571
After some discussion and more investigation, we better understand
what's going on. So, update the comment, so things are more clear
to future readers.
It is OK for some symbols to be missing. With this change, "test-nss sss" can
be used to test nss-sss without crashing.
$ build-rawhide/test-nss sss fedoraproject.org
======== sss ========
_nss_sss_gethostbyname4_r not defined
_nss_sss_gethostbyname3_r not defined
_nss_sss_gethostbyname3_r not defined
_nss_sss_gethostbyname3_r not defined
_nss_sss_gethostbyname3_r not defined
_nss_sss_gethostbyname2_r("fedoraproject.org", AF_INET) → status=NSS_STATUS_NOTFOUND
errno=0/--- h_errno=-1/Resolver internal error
_nss_sss_gethostbyname2_r("fedoraproject.org", AF_INET6) → status=NSS_STATUS_NOTFOUND
errno=0/--- h_errno=-1/Resolver internal error
_nss_sss_gethostbyname2_r("fedoraproject.org", *) → status=NSS_STATUS_UNAVAIL
errno=97/EAFNOSUPPORT h_errno=-1/Resolver internal error
_nss_sss_gethostbyname2_r("fedoraproject.org", AF_UNIX) → status=NSS_STATUS_UNAVAIL
errno=97/EAFNOSUPPORT h_errno=-1/Resolver internal error
_nss_sss_gethostbyname_r("fedoraproject.org") → status=NSS_STATUS_NOTFOUND
errno=0/--- h_errno=-1/Resolver internal error
We talked about the verification key, then about sealing keys, and then
about the verification key again. Let's shorten things a bit, and divide
the output in three paragraphs: one about the machine, one about the sealing
keys, and one about verification keys and the qr code with them.
From a report in https://bugzilla.redhat.com/show_bug.cgi?id=1861463:
usb-gadget.target: Failed to load configuration: No such file or directory
usb-gadget.target: Failed to load configuration: No such file or directory
usb-gadget.target: Trying to enqueue job usb-gadget.target/start/fail
usb-gadget.target: Failed to load configuration: No such file or directory
Assertion '!bus_error_is_dirty(e)' failed at src/libsystemd/sd-bus/bus-error.c:239, function bus_error_setfv(). Ignoring.
sys-devices-platform-soc-2100000.bus-2184000.usb-ci_hdrc.0-udc-ci_hdrc.0.device: Failed to enqueue SYSTEMD_WANTS= job, ignoring: Unit usb-gadget.target not found.
I *think* this is the place where the reuse occurs: we call
bus_unit_validate_load_state(unit, e) twice in a row.
This allows more granular access control in PolicyKit rules, similar to
/etc/sudoers, for polkit actions:
* org.freedesktop.machine1.host-shell
* org.freedesktop.machine1.shell
Example configuration, place in /etc/polkit-1/rules.d/
polkit.addRule(function(action, subject) {
if (action.id == "org.freedesktop.machine1.host-shell"
&& subject.user == "my-user"
&& action.lookup("user") == "target-user") {
return polkit.Result.YES;
}
});
I happen to have a machine where /boot is not a separate mountpoint,
but rather just a directory under /. After upgrade to recent Fedora,
I found out that grub2 can't find any new kernels.
This happens because loadentry script generates kernel and initrd file
paths relative to /boot, while grub2 expects path to be relative to the
root of filesystem on which they are residing.
This commit fixes this issue by using stat's %m to find the mount point
of a partition holding the images, and using it as a prefix to be
removed from ENTRY_DIR_ABS.
Note that %m for stat requires coreutils 8.6, released in Oct 2010.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Apparently, if IO is still in flight at the moment we invoke LOOP_CLR_FD
it is likely simply dropped (probably because yanking physical storage,
such as a USB stick would drop it too). Let's protect ourselves against
that and always sync explicitly before we invoke it.
Fixed below systemd codesonar warning.
isprint() is invoked here with an argument of signed
type char, but only has defined behavior for int arguments that are
either representable as unsigned char or equal to the value
of macro EOF(-1).
As per codesonar report, in a number of libc implementations, isprint()
function implemented using lookup tables (arrays): passing in a
negative value can result in a read underrun.
The explicit limit is dropped, which means that we return to the kernel default
of 50% of RAM. See 362a55fc14 for a discussion why that is not as much as it
seems. It turns out various applications need more space in /dev/shm and we
would break them by imposing a low limit.
While at it, rename the define and use a single macro for various tmpfs mounts.
We don't really care what the purpose of the given tmpfs is, so it seems
reasonable to use a single macro.
This effectively reverts part of 7d85383edb. Fixes#16617.
The new retry intervals are [15, 20, 26, 34, 45, 60, 80, 106, 141, 188, 250,
333, 360, ...]. This should allow graceful response if a transient network
failure is encountered. Growth is exponential, but with a small power and
capped to a non-too-large value so that we resynchronize within a few minutes
after network is restored. I made the minimum 15 s to make sure that we never
send packets more often than that.
Fixes#16492.
Allows to specify mount options for RootImage.
In case of multi-partition images, the partition number can be prefixed
followed by colon. Eg:
RootImageOptions=1:ro,dev 2:nosuid nodev
In absence of a partition number, 0 is assumed.
This should be enough to fix https://bugzilla.redhat.com/show_bug.cgi?id=1856514.
But the limit should be significantly higher than 10% anyway. By setting a
limit on /tmp at 10% we'll break many reasonable use cases, even though the
machine would deal fine with a much larger fraction devoted to /tmp.
(In the first version of this patch I made it 25% with the comment that
"Even 25% might be too low.". The kernel default is 50%, and we have been using
that seemingly without trouble since https://fedoraproject.org/wiki/Features/tmp-on-tmpfs.
So let's just make it 50% again.)
See 7d85383edb.
(Another consideration is that we learned from from the whole initiative with
zram in Fedora that a reasonable size for zram is 0.5-1.5 of RAM, and that pretty
much all systems benefit from having zram or zswap enabled. Thus it is reasonable
to assume that it'll become widely used. Taking the usual compression effectiveness
of 0.2 into account, machines have effective memory available of between
1.0 - 0.2*0.5 + 0.5 = 1.4 (for zram sized to 0.5 of RAM) and
1.0 - 0.2*1.5 + 1.5 = 2.2 (for zram 1.5 sized to 1.5 of RAM) times RAM size.
This means that the 10% was really like 7-4% of effective memory.)
A comment is added to mount-util.h to clarify that tmp.mount is separate.
Previously, on DHCPv4 address renewal, the old address may be removed
while the new address is not ready yet.
This also simplifies the logic of removing address and routes.