This fixes a race where a block device that pops up and immediately is
locked (such as a loopback device in preparation) might result in
udev never run any rules for it, and thus never turn on inotify watching
for it (as inotify watching is controlled via an option set via udev
rules), thus not noticing when the device is unlocked/closed again
(which is noticed via IN_CLOSE_WRITE inotify events).
This changes two things:
1. Whenever we encounter a locked block device we'll now inotify watch
it, so that it is guaranteed we'll notice when the BSD lock fd is
closed again, and will reprobe.
2. We'll now turn off inotify watching again once we realise the
udev rules don't actually want that. Previously, once watching a
device was enabled via a udev rule, it would be watched forever until
the device disappeared, even if the option was dropped by the rules
for later events.
Together this will make sure that we'll watch the device via inotify
in both of the following cases:
a) The block device has been BSD locked when udev wanted to look at it
b) The udev rules run for the last seen event for the device say so
In all other cases inotify is off for block devices.
This new behaviour both fixes the race, but also makes the most sense,
as the rules (when they are run) actually really control the watch state
now. And if someone BSD locks a block device then it should be OK to
inotify watch it briefly until the lock is released again as the user
this way more or less opts into the locking protocol.
It makes sense to ignore all the common fields that are expected and
that we can safely ignore. Note that it is fine to ignore URL as we will
already warn about the type= being wrong in that case.
Closes: #17276
This helper alone doesn't make too much sense, but it's preparatory work
for #17274, and I guess it can't hurt to land it early, it does make the
ratelimit code a tiny bit prettier after all.
While a server is in the VARLINK_PENDING_METHOD or VARLINK_PENDING_METHOD_MORE
states and its write end is disconnected and it gets a POLLHUP, we
should disconnect since it can't write anymore.
In the case of systemd-oomd disconnecting while pid1 was pending-more, this
condition left pid1 in a state where it started throttling from
continually getting POLLHUP.
This is just some refactoring: shifting around of code, not change in
codeflow.
This splits up the way too huge systemctl.c in multiple more easily
digestable files. It roughly follows the rule that each family of verbs
gets its own .c/.h file pair, and so do all the compat executable names
we support. Plus three extra files for sysv compat (which existed before
already, but I renamed slightly, to get the systemctl- prefix lik
everything else), a -util file with generic stuff everything uses, and a
-logind file with everything that talks directly to logind instead of
PID1.
systemctl is still a bit too complex for my taste, but I think this way
itc omes in a more digestable bits at least.
No change of behaviour, just reshuffling of some code.
It's an auxiliary function to the UnitInfo structures, and very generic.
Let's hence move it over to the other code operating with UnitInfo, even
if it's not used by code outside of systemctl (yet).
We have an option to set the fallback list, so we don't know what the contents
are. It may in fact be empty. Let's add some examples to make it easy for a user
stranded without any DNS to fill in something that would work. As a bonus, this
also gives names to the entries we provide by default.
(I added google and cloudflare because that's what we have currently, and quad9
because it seems to be a good privacy-concious and fast choice and was requested
in #12499. As a minimum, things we should include should be well-known global
services with a documented privacy policy and both IPv4 and IPv6 support and
decent response times.)
Also, document this functionality more prominently, including with a
reference from sd_event_exit().
This is mostly to make things complete, as previously we supported NULL
callbacks only in _add_time() and _add_signal(). However, I think this
makes snese for IO event sources too (think: when some fd such as a pipe
end sees SIGHUP or so, exit), as well as defer or post event sources (i.e. exit
once we got nothing else to do). This also adds support for inotify
event sources, simply to complete things (I can't see the immediate use,
but maybe someone else comes up with it).
The only event source type that doesn't allow callback=NULL now are exit
callbacks, but for them they make little sense, as the event loop is
exiting then anyway.
Currently, if an event source callback returns an error, we'll disable
the event source and continue. This adds a per-event source flag that if
turned on goes further: the event loop is also exited, propagating the
error code.
This is inspired by some patterns repeatedly seen in #15206.
The idea is that event sources that server the "primary" function of a
program are marked like this, so that if they fail the failure is
instantly propagated and terminates the program.
Some extra safety when invoked via "sudo". With this we address a
genuine design flaw of sudo, and we shouldn't need to deal with this.
But it's still a good idea to disable this surface given how exotic it
is.
Prompted by #5666
Functions that accept no arguments should be
explicitly declared a void parameter in their parameter list.
Signed-off-by: Marco Wang <m.aesophor@gmail.com>
Previously, address_establish() took Address object stored in Network
object. And address_release() took Address object stored in Link
object. Thus, address_release() always did nothing.
When the MAC address of a link is updated, an address on the link may
be under checking address duplication. Or, (currently such code is not
implemented yet, but) address duplication check may be restarted later.
For that case, the IPv4 ACD clients must use the new updated MAC address.
Previously, IPv4 DAD is configured in each Address object stored in
Network object. If a .network file matches multipe links, then it causes
an assertion. To prevent it, now IPv4 DAD is configured in each Address
object belogs to Link object.
Currently systemd-detect-virt fails to detect running under PowerVM.
Add code to detect PowerVM based on code in util-linux.
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
We should chown what we allocate ourselves, i.e. any pty we allocate
ourselves. But for stuff we propagate, let's avoid that: we shouldn't
make more changes than necessary.
Fixes: #17229
If we set O_NONBLOCK on stdin/stdout directly this means the flag is
left set when we abort abnormally, as we don't get the chance to reset
it again on exit. This might confuse progrms invoking us. Moreover, if
programs invoking us continue to write to the stdout passed to us, they
might be confused by non-blocking mode too.
Hence, let's avoid this if possible: let's reopen stdin/stdout and set
O_NONBLOCK only on the reopend fds, leaving the original fds as they
are.
Prompted-by: https://github.com/systemd/systemd/pull/17070#issuecomment-702304802
*** Running /home/zbyszek/src/systemd-work/test/test-sysusers/test-14.input (with login.defs symlinked)
login.defs specifies UID allocation range 401–555 that is different than the built-in defaults (201–998)
login.defs specifies GID allocation range 405–666 that is different than the built-in defaults (201–990)
Also, even if login.defs are not present, don't start allocating at 1, but at
SYSTEM_UID_MIN.
Fixes#9769.
The test is adjusted. Actually, it was busted before, because sysusers would
never use SYSTEM_GID_MIN, so if SYSTEM_GID_MIN was different than
SYSTEM_UID_MIN, the tests would fail. On all "normal" systems the two are
equal, so we didn't notice. Since sysusers now always uses the minimum of the
two, we only need to substitute one value.
We don't (and shouldn't I think) look at them when determining the type of the
user, but they should be used during user/group allocation. (For example, an
admin may specify SYS_UID_MIN==200 to allow statically numbered users that are
shared with other systems in the range 1–199.)
It makes little sense to make the boundary between systemd and user guids
configurable. Nevertheless, a completely fixed compile-time define is not
enough in two scenarios:
- the systemd_uid_max boundary has moved over time. The default used to be
500 for a long time. Systems which are upgraded over time might have users
in the wrong range, but changing existing systems is complicated and
expensive (offline disks, backups, remote systems, read-only media, etc.)
- systems are used in a heterogenous enviornment, where some vendors pick
one value and others another.
So let's make this boundary overridable using /etc/login.defs.
Fixes#3855, #10184.
THis doesn't change the condition's logic at all, but is an attempt to
make things a bit more readable: instead of checking log_target !=
LOG_TARGET_AUTO let's actually list the targets where we want to
consider journal/syslog/kmsg, to make things a bit less confusing. After
all the message here is not to avoid them if LOG_TARGET_AUTO is set, but
to definitely do them in the other cases.
Let's explicitly deactivate all home dirs on shutdown, in order to
properly synchronizing unmounting and avoiding blocking devices.
Previously, we'd rely on automatic deactivation when home directories
become unused. However, that scheme is asynchronous, and ongoing
deactviations might conflicts with attempts to unmount /home. Let's fix
that by providing an explicit service systemd-homed-activate.service
whose only job is to have a ExecStop= line that explicitly deactivates
all home directories on shutdown. This service can the be ordered after
home.mount and similar, ensuring that we'll first deactivate all homes
before deactivating /home itself during shutdown.
This is kept separate from systemd-homed.service so that it is possible
to restart systemd-homed.service without deactivating all home
directories.
Fixes: #16842
If the hostname of a system is set to an fqdn, glibc traditionally
derives a search domain from it if none is explicitly configured.
This is a bit weird, and we currently don't do that in our own search
path logic.
Following #17193 let's turn this behaviour off for now.
Yes, this has a slight chance of pissing people off who think this
behaviour is good. If this is indeed an issue, we can revisit the issue
but in that case if we readd the concept we should do it properly:
derive the search domain from the fqdn in our codebase too and report it
in resolvectl, and in our generated stub files. But I have the suspicion
most people who set the hostname to an fqdn aren#t even aware of this
behaviour nor want it, so let's wait until people complain.
Fixes: #17193
It can be one of "foreign", "missing", "stub", "static", "uplink",
depending on how /etc/resolv.conf is set up:
foreign → someone/something else manages /etc/resolv.conf,
systemd-resolved is just the consumer
missing → /etc/resolv.conf is missing altogether
stub/static/uplink → the file is managed by resolved, with the
well-known modes
Fixes: #17159
This basically implements fc58c0c7bf for gshadow.
gpasswd may not have a lock/unlock that behaves the same as passwd, but
according to gshadow(5) the logic of the password field is the same.
This is like membarrier() I guess and basically just exposes CPU
functionality via kernel syscall on some archs. Let's whitelist it for
everyone.
Fixes: #17197
This effectively reverts commit 67acde4869.
After commits 569ad251ad and
67acde4869, -EACCES errors are ignored,
and thus 'udevadm trigger' succeeds even when it is invoked by non-root
users. Moreover, on -EACCES error, log messages are shown in debug
level, so usually we see no message, and users are easily confused
why uevents for devices are not triggered.
It always was the intention to expose this as trusted field _TID=, i.e.
automatically determine it from journald via some SCM_xyz field or so,
but this is never happened, and it's unlikely this will be added anytime
soon to the kernel either, hence let's just generate this sender side,
even if it means it's untrusted.
Let's turn off the search domain logic if a trailing dot is specified
when looking up hostnames and RRs via the Varlink + D-Bus APIs (and thus
also when doing so via nss-resolve). (This doesn't affect lookups via
the stub, since for the any search path logic is done client side
anyway)
It might make sense to force the DNS protocol in this case too (and
disable LLMR + mDNS), but we'll leave that for a different PR — if it
even makes sense. It might also make sense to disable the logic of never
routing single-label lookups to the Internet if a trailing to is
specified, but this needs more discussion too.
Strivtly speaking, this breaks backward compatibility. But setting
too large value into them, then their networking easily breaks.
Note that typically 100 for them is event too large. So, ommiting the
values equal or higher than 1024, and dropping support of k, M, and G
suffixes is OK for normal appropriate use cases.
See discussion in #16643.
Let's open the device node to modify with O_PATH, and then adjust it
only after verifying everything is in order. This fixes a race where the
a device appears, disappears and quickly reappers, while we are still
running the rules for the first appearance: when going by path we'd
possibly adjust half of the old and half of the new node. By O_PATH we
can pin the node while we operate on it, thus removing the race.
Previously, we'd do a superficial racey check if the device node changed
undearneath us, and would propagate EEXIST in that case, failing the
rule set. With this change we'll instead gracefully handle this, exactly
like in the pre-existing case when the device node disappeared in the
meantime.
Fixes#16991fb39af4ce4 replaced `free_arguments()` with
`reset_arguments()`, which frees arg_* variables as before, but also resets all
of them to the default values. `reset_arguments()` was positioned
in such a way that it overrode some arg_* values still in use at shutdown.
To avoid further unintentional resets, I moved `reset_arguments()`
right before the return, when nothing else will be using the arg_* variables.
If a namespace with PrivateTmp=true is constructed we need to restore
the context of the namespaces /tmp directory (i.e.
/tmp/systemd-private-XXXXX/tmp) to the (default) context of /tmp .
Otherwise filetransitions might result in the namespaces tmp directory
having the wrong context.
After https://github.com/systemd/systemd/pull/16981 only the presence of crypt_gensalt_ra
is checked, but there are cases where that function is available but crypt_preferred_method
is not, and they are used in the same ifdef.
Add a check for the latter as well.
Adds support for LUKS detached header device on kernel
command line. It's introduced via extension to existing
luks.options 'header=' argument beyond colon (see examples
below). If LUKS header device is specified it's expected
to contain filesystem with LUKS header image on a path
specified in the first part of header specification.
The second parameter 'luks.data' specifies LUKS data device
supposed to be paired with detached LUKS header (note that
encrypted LUKS data device with detached header is unrecognisable
by standard blkid probe).
This adds support for LUKS encrypted rootfs partition with
detached header. It can also be used for initializing online LUKS2
encryption of data device.
Examples:
luks.data=<luks_uuid>=/dev/sdz
luks.data=<luks_uuid>=/dev/vg/lv
luks.data=<luks_uuid>=/dev/mapper/lv
luks.data=<luks_uuid>=PARTUUID=<part_uuid>
luks.data=<luks_uuid>=PARTLABEL=<part_uuid>
luks.options=<luks_uuid>=header=/header/path:UUID=<fs_uuid>
luks.options=<luks_uuid>=header=/header/path:PARTUUID=<part_uuid>
luks.options=<luks_uuid>=header=/header/path:PARTLABEL=<part_label>
luks.options=<luks_uuid>=header=/header/path:LABEL=<fs_label>
luks.options=<luks_uuid>=header=/header/path:/dev/sdx
luks.options=<luks_uuid>=header=/header/path:/dev/vg/lv
The '/header/path' is considered to be relative location within
filesystem residing on the header device specified beyond colon
character
Fixes#17035. We use "," as the separator between arguments in fstab and crypttab
options field, but the kernel started using "," within arguments. Users will need
to escape those nested commas.
`startswith` already returns the string with the prefix skipped, so we
can simplify this further and avoid using a magic value.
Noticed in passing.
Co-authored-by: Lennart Poettering <lennart@poettering.net>
SD_DHCP6_OPTION_IA_NA does not exist in DHCP6_ADVERTISE packet if DHCP server only provides prefix delegation. So the attempt to send the DHCP6_REQUEST packet fails on r = dhcp6_option_append_ia(&opt, &optlen, &client->lease->ia); forever.
When invoked as non-root, we would suggest re-running as root without any
further hint. But this immediately spawns a machine from the local directory,
which can be rather surprising. So let's give a better hint.
(In general, I don't think commandline programs should do "significant" things
when invoked without any arguments. In this regard it would be better if
systemd-nspawn would not spawn a machine from the current directory if called
with no arguments and at least "-D ." would be required.)
If the whole call is simple and we don't need to look at the return value
apart from the conditional, let's use a form without assignment of the return
value. When the function call is more complicated, it still makes sense to
use a temporary variable.
Lennart wanted to do this back in
01c33c1eff.
For better or worse, this wasn't done because I thought that turning on MountAPIVFS
is a compat break for RootDirectory and people might be negatively surprised by it.
Without this, search for binaries doesn't work (access_fd() requires /proc).
Let's turn it on, but still allow overriding to "no".
When RootDirectory=/, MountAPIVFS=1 doesn't work. This might be a buglet on its
own, but this patch doesn't change the situation.
Right now, we always say `/etc/crypttab` even if the source was fully
derived from the kargs.
Let's match what `systemd-fstab-generator` does and use `/proc/cmdline`
when that's the case.
(Well, at least the ones where that makes sense. Where it does't make
sense are the ones that re invoked on the root path, which cannot
possibly be a symlink.)
Let's make umount_verbose() more like mount_verbose_xyz(), i.e. take log
level and flags param. In particular the latter matters, since we
typically don't actually want to follow symlinks when unmounting.
This is a follow-up for cae1e8fb88c5a6b0960a2d0be3df8755f0c78462: we
also call the detach ioctls in the shutdown code, hence add the fsync()s
there too, just to be safe.
Let's make sure to use strna() on the strings returned by fd_get_path()
where we knowingly ignore any failures. We got this right in most cases,
but two were missing.
We use it pretty much everywhere else, hence use it here too.
This also changes the error generated from EOPNOTSUPP to ENOSYS, to
match the other cases where we do such a check. One user checked for
EOPNOTSUPP which is updated to check for ENOSYS instead.
Currently the systemd-shutdown command attempts to stop swaps, DM
(crypt, LVM2) and loop devices, but it doesn't attempt to stop MD
RAID devices, which means that if the RAID is set up on crypt,
loop, etc. device, it won't be able to stop those underlying devices.
This code extends the shutdown application to also attempt stopping
the MD RAID devices.
Signed-off-by: Hubert Kario <hubert@kario.pl>
The directory backend needs a file system path, and not a raw block
device. That's only supported for the LUKS2 backend.
Let's make this clearer in the man page and also generate a better error
message if attempted anyway.
Fixes: #17068
When exiting, let's explicitly wait for our worker processes to finish
first. That's useful if unmounting of /home/ is scheduled to happen
right after homed is down, as we then can be sure that the home
directories are properly unmounted and detached by the time homed is
fully terminated (otherwise it might happen that our worker gets killed
by the service manager, thus leaving the home directory and its backing
devices up/left for auto-clean which might be async).
Likely fixes#16842
When debugging homed while being logged into a user account manged by
homed it is a good idea to be able to run a second copy of homed. In
order to not collide with its AF_UNIX socket and bus name use, let's add
a new env var $SYSTEMD_HOME_DEBUG_SUFFIX, when set the busnames/socket
names are suffixed by it. When setting this while debugging one can
invoke an additional copy without interfering with the host one.
This has two advantages:
- we save a bit of IO in early boot because we don't look for executables
which we might never call
- if the executable is in a different place and it was specified as a
non-absolute path, it is OK if it moves to a different place. This should
solve the case paths are different in the initramfs.
Since the executable path is only available quite late, the call to
mac_selinux_get_child_mls_label() which uses the path needs to be moved down
too.
Fixes#16076.
Similar to free_and_replace. I think this should be uppercase to make it
clear that this is a macro. free_and_replace should probably be uppercased
too.
Other tools silently ignore non-executable names found in path. By checking
F_OK, we would could pick non-executable path even though there is an executable
one later.
We had a special test case that the second semicolon would be interpreted
as an executable name. We would then try to find the executable and rely
on ";" not being found to cause ENOEXEC to be returned. I think that's just
crazy. Let's treat the second semicolon as a separator and ignore the
whole thing as we would whitespace.
test-execute is quite long and even with the test name it takes a moment
to find the relevant spot when something fails. Let's make things easier
by printing the exact location.