Behaviour is not identical, as shown by the tests in test-strv.
The combination of EXTRACT_UNQUOTE without EXTRACT_RELAX only appears in
the test, so it doesn't seem particularly important. OTOH, the difference
in handling of squished parameters could make a difference. New behaviour
is what both bash and python do, so I think we can ignore this corner case.
This change has the following advantages:
- the duplication of code paths that do a very similar thing is removed
- extract_one_word() / strv_split_extract() return a proper error code.
The commit 10ce2e0681 inverts the order of
SO_{RCV,SND}BUFFORCE and SO_{RCV,SND}BUF. However, setting buffer size with
SO_{RCV,SND}BUF does not fail even if the requested size is larger than
the kernel limit. Hence, SO_{RCV,SND}BUFFORCE will not use anymore and
the buffer size is always limited by the kernel limit even if we have
the priviledge to ignore the limit.
This makes the buffer size is checked after configuring it with
SO_{RCV,SND}BUF, and if it is still not sufficient, then try to set it
with FORCE command. With this commit, if we have enough priviledge, the
requested buffer size is correctly set.
Hopefully fixes#14417.
On systems that boot without initrd on a btrfs root file systems the
BTRFS_IOC_DEV_INFO ioctl returns /dev/root as backing device. That
sucks, since that is not a real device visible to userspace.
Since this has been that way since forever, and it doesn't look like the
kernel will get fixed soon for this, let's at least generate a useful
error message in this case.
This is not a bug fix, just a tweak to make this more recognizable.
Once the kernel gets fixed to report the correct device nodes in this
case, in a way userspace can make sense of them things will magically
work for systemd, too.
(Note that this doesn't add a log message about this to really all cases
we call get_device() in, but just the main ones that are called in early
boot context, after all all there's no benefit in seeing this message
too many times.)
https://github.com/systemd/systemd/issues/16953https://bugs.freedesktop.org/show_bug.cgi?id=84689https://bugzilla.kernel.org/show_bug.cgi?id=89721
log_debug still returns 0. I think it is legitimate to use 'return log_debug()' to
return 0. It is different than the other functions, since we often want to supress
errors logged at debug level. This case is quite common in the codebase and
we could use 'return log_debug_errno()' to make the code more consise.
For all other variants, a separate return line is required.
Previous commit changes all the non-conforming instances, now we can make it mandatory.
Binaries might not initialize SELinux, e.g. when they normally do not
create files with the SELinux default context.
If they, via an internal libary function, call a _label() function,
mac_selinux_maybe_reload() gets called. Since the SELinux status page
has not been opened, selinux_status_updated() will fail with EINVAL.
This affects particularly test binaries.
Just exit early and avoid confusing debug logs.
I had to move STRV_MAKE to macro.h. There is a circular dependency between
extract-word.h, strv.h, and string-util.h that makes it hard to define the
inline function otherwise.
This simplifies things quite a bit, and is reusable wherever we want to
use statx() later on. Not sure why I didn't do it like this right from
the beginning...
I think this is nicer in general, and here in particular we have a lot
of code like:
static inline IteratedCache* hashmap_iterated_cache_new(Hashmap *h) {
return (IteratedCache*) _hashmap_iterated_cache_new(HASHMAP_BASE(h));
}
and it's visually appealing to use the same whitespace in the function
signature and the cast in the body of the function.
The compiler would do this to, esp. with LTO, but we can short-circuit the
whole process and make everything a bit simpler by avoiding the separate
definition.
(It would be nice to do the same for _set_new(), _set_ensure_allocated()
and other similar functions which are one-line trivial wrappers too. Unfortunately
that would require enum HashmapType to be made public, which we don't want
to do.)
Up to now the capability CAP_SETPCAP was raised implicitly in the
function capability_bounding_set_drop.
This functionality is moved into a new function
(capability_gain_cap_setpcap).
The new function optionally provides the capability set as it was
before raisining CAP_SETPCAP.
I think it's nicer to move it to the left, since the function
is already a pointer by itself, and it just happens to return a pointer,
and the two concepts are completely separate.
Instead of assuming that more-recently modified directories have higher mtime,
just look for any mtime changes, up or down. Since we don't want to remember
individual mtimes, hash them to obtain a single value.
This should help us behave properly in the case when the time jumps backwards
during boot: various files might have mtimes that in the future, but we won't
care. This fixes the following scenario:
We have /etc/systemd/system with T1. T1 is initially far in the past.
We have /run/systemd/generator with time T2.
The time is adjusted backwards, so T2 will be always in the future for a while.
Now the user writes new files to /etc/systemd/system, and T1 is updated to T1'.
Nevertheless, T1 < T1' << T2.
We would consider our cache to be up-to-date, falsely.
This allows us to properly detect mount points, for free. (Also, allows
us to respect btimes that are newer than the cutoff, which should be
useful when people untar file trees in /var/tmp)
Fixes: #16848
There is little point in #defining and #undefining CAP_LAST_CAP multiple times.
The check is only done in developer mode. After all, it's not an error to
compile on a newer kernel, and we shouldn't even warn in that case.
Switch from security_getenforce() and netlink notifications to the
SELinux status page.
This usage saves system calls and will also be the default in
libselinux > 3.1 [1].
[1]: 05bdc03130
Previously:
1. last_error wouldn't be updated with errors from is_dir;
2. We'd always issue a stat(), even for binaries without execute;
3. We used stat() instead of access(), which is cheaper.
This change avoids all of those, by only checking inside X_OK-positive
case whether access() works on the path with an extra slash appended.
Thanks to Lennart for the suggestion.
It's pretty useful to be able to access the bytes generically, without
acknowledging a specific family, hence let's a third way to access an
in_addr_union.
Imagine $PATH /a:/b. There is an echo command at /b/echo. Under this
configuration, this works fine:
% systemd-run --user --scope echo .
Running scope as unit: run-rfe98e0574b424d63a641644af511ff30.scope
.
However, if I do `mkdir /a/echo`, this happens:
% systemd-run --user --scope echo .
Running scope as unit: run-rcbe9369537ed47f282ee12ce9f692046.scope
Failed to execute: Permission denied
We check whether the resulting file is executable for the performing
user, but of course, most directories are anyway, since that's needed to
list within it. As such, another is_dir() check is needed prior to
considering the search result final.
Another approach might be to check S_ISREG, but there may be more gnarly
edge cases there than just eliminating this obviously pathological
example, so let's just do this for now.
When removing a directory tree as unprivileged user we might encounter
files owned by us but not deletable since the containing directory might
have the "r" bit missing in its access mode. Let's try to deal with
this: optionally if we get EACCES try to set the bit and see if it works
then.
Instead of defining the numbers only as fallback, always define our fallback
number, and if we have the real __NR_foo define, assert that our number matches
the real one.
This will result in warnings when our fallback number is not defined, even if
the kernel headers are new enough to define __NR_foo. This will probably annoy
people compiling for seldom-used architectures, but hopefully it'll provide
motivation to add the missing fallback defines.
The upside is that we have a higher chance of catching the cases where we got
the number wrong. Calling the wrong syscall is quite problematic, and with some
back luck, it might take us a long time to notice that we got the number wrong
on some rarely used architecture.
Also, rework some of the fallback wrappers to not call the syscall with a
negative number (that'd fail, but we'd got to the kernel and back). It seems
nicer to let the compiler know that this can never succeed.
Similarly to "setup" vs. "set up", "fallback" is a noun, and "fall back"
is the verb. (This is pretty clear when we construct a sentence in the
present continous: "we are falling back" not "we are fallbacking").
This has the major benefit that the entire payload of the container can
access these files there. Previously, we'd set them only as env vars,
but that meant only PID 1 could read them directly or other privileged
payload code with access to /run/1/environ.
The kernel finally has a proper API to determine the mnt_id of a file.
Let's use it.
This adds support for the STATX_MNT_ID field of statx(), added in
kernel 5.8.
This was needed before commit 16e4fd87c5 added a
mode that opens the log fds for every single log message. This mode is used in
execute.c since then making the explicit call to log_open unnecessary.
This basically reverts ea89a119cd.
LOOP_CONFIGURE allows us to configure a loopback device in one ioctl
instead of two, which is not just faster but also removes the race that
udev might start probing the device before we adjusted things properly.
Unfortunately LOOP_CONFIGURE is broken in regards to LO_FLAGS_PARTSCAN
as of kernel 5.8.0. This patch contains a work-around for that, to
fallback to old behaviour if partition scanning is requested but does
not work. Sucks a bit.
Proposed upstream fix for that issue:
https://lkml.org/lkml/2020/8/6/97
Given a string in the format 'one:two three four:five', returns a string
vector with each word. If the second element of the tuple is not
present, an empty string is returned in its place, so that the vector
can be processed in pairs.
[zjs: use EXTRACT_UNESCAPE_SEPARATORS instead of EXTRACT_CUNESCAPE_RELAX.
This way we do escaping exactly once and in normal strict mode.]
We would add and remove definitions based on which colors were used by other
code. Let's just define all of them to simplify tests and allow easy comparisons
which colors look good.
Let's split this out into its own helper function we can reuse at
various places.
Also, let's avoid signed values where we can so that we can cover more
of the available time range.
We never return anything higher than 63, so using "long unsigned"
as the type only confused the reader. (We can still use "long unsigned"
and safe_atolu() to parse the kernel file.)
We would refuse to print capabilities which were didn't have a name
for. The kernel adds new capabilities from time to time, most recently
cap_bpf. 'systmectl show -p CapabilityBoundingSet ...' would fail with
"Failed to parse bus message: Invalid argument" because
capability_set_to_string_alloc() would fail with -EINVAL. So let's
print such capabilities in hexadecimal:
CapabilityBoundingSet=cap_chown cap_dac_override cap_dac_read_search
cap_fowner cap_fsetid cap_kill cap_setgid cap_setuid cap_setpcap
cap_linux_immutable cap_net_bind_service cap_net_broadcast cap_net_admin
cap_net_raw cap_ipc_lock cap_ipc_owner 0x10 0x11 0x12 0x13 0x14 0x15 0x16
0x17 0x18 0x19 0x1a ...
For symmetry, also allow capabilities that we don't know to be specified.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1853736.
https://tools.ietf.org/html/draft-knodel-terminology-02https://lwn.net/Articles/823224/
This gets rid of most but not occasions of these loaded terms:
1. scsi_id and friends are something that is supposed to be removed from
our tree (see #7594)
2. The test suite defines an API used by the ubuntu CI. We can remove
this too later, but this needs to be done in sync with the ubuntu CI.
3. In some cases the terms are part of APIs we call or where we expose
concepts the kernel names the way it names them. (In particular all
remaining uses of the word "slave" in our codebase are like this,
it's used by the POSIX PTY layer, by the network subsystem, the mount
API and the block device subsystem). Getting rid of the term in these
contexts would mean doing some major fixes of the kernel ABI first.
Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
Presently, CLI utilities such as systemctl will check whether they have a tty
attached or not to decide whether to parse /proc/cmdline or EFI variable
SystemdOptions looking for systemd.log_* entries.
But this check will be misleading if these tools are being launched by a
daemon, such as a monitoring daemon or automation service that runs in
background.
Make log handling of CLI tools uniform by never checking /proc/cmdline or EFI
variables to determine the logging level.
Furthermore, introduce a new log_setup_cli() shortcut to set up common options
used by most command-line utilities.
Also use double space before the tracking args at the end. Without
the comma this looks ugly, but it's a bit better with the double space.
At least it doesn't look like a variable with a type.
This combines set_ensure_allocated() with set_consume(). The cool thing is that
because we know the hash ops, we can correctly free the item if appropriate.
Similarly to set_consume(), the goal is to simplify handling of the case where
the item needs to be freed on error and if already present in the set.
* Drop mac_selinux_use() condition from mac_selinux_free(): if the
passed pointer holds memory we want to free it even if SELinux is
disabled
* Drop NULL-check cause man:freecon(3) states that freecon(NULL) is a
well-defined NOP
* Assert that on non-SELinux builds the passed pointer is always NULL,
to avoid memory leaks
This just adds a _cleanup_ helper call encapsulating dlclose().
This also means libsystemd-shared is linked against libdl now. I don't
think this is much of an issue, since libdl is part of glibc anyway, and
anything from exotic. It's not an optional part of the OS (think: NSS
requires dynamic linking), hence this pulls in no deps and is almost
certainly loaded into all process' memory anyway.
[zj: use DEFINE_TRIVIAL_CLEANUP_FUNC().]
It's such a common operation to allocate the set and put an item in it,
that it deserves a helper. set_ensure_put() has the same return values
as set_put().
Comes with tests!
../src/core/main.c: In function 'main':
../src/core/main.c:2637:32: error: implicit declaration of function 'cache_efi_options_variable'; did you mean 'systemd_efi_options_variable'? [-Werror=implicit-function-declaration]
(void) cache_efi_options_variable();
^~~~~~~~~~~~~~~~~~~~~~~~~~
systemd_efi_options_variable
The original logic was logging an "ignored" debug message, but it was still
going ahead and calling proc_cmdline_parse_given() on the NULL line. Fix that
to skip that explicitly when the EFI variable wasn't really read.
Cache it early in startup of the system manager, right after `/run/systemd` is
created, so that further access to it can be done without accessing the EFI
filesystem at all.
Prompted by the discussion on #16110, let's migrate more code to
fd_wait_for_event().
This only leaves 7 places where we call into poll()/poll() directly in
our entire codebase. (one of which is fd_wait_for_event() itself)
"less" doesn't properly reset its terminal on SIGTERM, it does so only
on SIGINT. Let's thus configure SIGINT instead of SIGTERM.
I think this is something less should fix too, and clean up things
correctly on SIGTERM, too. However, given that we explicitly enable
SIGINT behaviour by passing "K" to $LESS I figure it makes sense if we
also send SIGINT instead of SIGTERM to match it.
Fixes: #16084
poll() sets POLLNVAL inside of the poll structures if an invalid fd is
passed. So far we generally didn't check for that, thus not taking
notice of the error. Given that this specific kind of error is generally
indication of a programming error, and given that our code is embedded
into our projects via NSS or because people link against our library,
let's explicitly check for this and convert it to EBADF.
(I ran into a busy loop because of this missing check when some of my
test code accidentally closed an fd it shouldn't close, so this is a
real thing)
The usual behaviour when a timeout expires is to terminate/kill the
service. This is what user usually want in production systems. To debug
services that fail to start/stop (especially sporadic failures) it
might be necessary to trigger the watchdog machinery and write core
dumps, though. Likewise, it is usually just a waste of time to
gracefully stop a stuck service. Instead it might save time to go
directly into kill mode.
This commit adds two new options to services: TimeoutStartFailureMode=
and TimeoutStopFailureMode=. Both take the same values and tweak the
behavior of systemd when a start/stop timeout expires:
* 'terminate': is the default behaviour as it has always been,
* 'abort': triggers the watchdog machinery and will send SIGABRT
(unless WatchdogSignal was changed) and
* 'kill' will directly send SIGKILL.
To handle the stop failure mode in stop-post state too a new
final-watchdog state needs to be introduced.
Let's allow "-0" as alternative to "+0" and "0" when parsing integers,
unless the new SAFE_ATO_REFUSE_PLUS_MINUS flag is specified.
In cases where allowing the +/- syntax shall not be allowed
SAFE_ATO_REFUSE_PLUS_MINUS is the right flag to use, but this also means
that -0 as only negative integer that fits into an unsigned value should
be acceptable if the flag is not specified.
Quoting https://github.com/systemd/systemd/issues/14828#issuecomment-635212615:
> [kernel uses] msleep_interruptible() and that means when the process receives
> any kind of signal masked or not this will abort with EINTR. systemd-logind
> gets signals from the TTY layer all the time though.
> Here's what might be happening: while logind reads the EFI stuff it gets a
> series of signals from the TTY layer, which causes the read() to be aborted
> with EINTR, which means logind will wait 50ms and retry. Which will be
> aborted again, and so on, until quite some time passed. If we'd not wait for
> the 50ms otoh we wouldn't wait so long, as then on each signal we'd
> immediately retry again.
We would parse numbers with base prefixes as user identifiers. For example,
"0x2b3bfa0" would be interpreted as UID==45334432 and "01750" would be
interpreted as UID==1000. This parsing was used also in cases where either a
user/group name or number may be specified. This means that names like
0x2b3bfa0 would be ambiguous: they are a valid user name according to our
documented relaxed rules, but they would also be parsed as numeric uids.
This behaviour is definitely not expected by users, since tools generally only
accept decimal numbers (e.g. id, getent passwd), while other tools only accept
user names and thus will interpret such strings as user names without even
attempting to convert them to numbers (su, ssh). So let's follow suit and only
accept numbers in decimal notation. Effectively this means that we will reject
such strings as a username/uid/groupname/gid where strict mode is used, and try
to look up a user/group with such a name in relaxed mode.
Since the function changed is fairly low-level and fairly widely used, this
affects multiple tools: loginctl show-user/enable-linger/disable-linger foo',
the third argument in sysusers.d, fourth and fifth arguments in tmpfiles.d,
etc.
Fixes#15985.
"internal" is a lot of characters. Let's take a leaf out of the Python's book
and simply use _ to mean private. Much less verbose, but the meaning is just as
clear, or even more.
This is a safey net anyway, let's make it fully safe: if the data ends
on an uneven byte, then we need to complete the UTF-16 codepoint first,
before adding the final NUL byte pair. Hence let's suffix with three
NULs, instead of just two.
Tracking down #15931 confused the hell out of me, since running homed in
gdb from the command line worked fine, but doing so as a service failed.
Let's make this more debuggable and check if we live in the host netns
when allocating a new udev monitor.
This is just debug stuff, so that if things don't work, a quick debug
run will reveal what is going on.
That said, while we are at it, also fix unexpected closing of passed in
fd when failing.
Possibly fixes#15220. (There might be another leak. I'm still investigating.)
The leak would occur when the path cache was rebuilt. So in normal circumstances
it wouldn't be too bad, since usually the path cache is not rebuilt too often. But
the case in #15220, where new unit files are created in a loop and started, the leak
occurs once for each unit file:
$ for i in {1..300}; do cp ~/.config/systemd/user/test0001.service ~/.config/systemd/user/test$(printf %04d $i).service; systemctl --user start test$(printf %04d $i).service;done
Let's be more thorough that whenever we build a unit name based on
parameters, that the result is actually a valid user name. If it isn't
fail early.
This should allows us to catch various issues earlier, in particular
when we synthesize mount units from /proc/self/mountinfo: instead of
actually attempting to allocate a mount unit we will fail much earlier
when we build the name to synthesize the unit under. Failing early is a
good thing generally.
And do not use it in the IMPORT{cmdline} udev code. Wherever we expose
direct interfaces to check the kernel cmdline, let's not consult our
systemd-specific EFI variable, but strictly use the actual kernel
variable, because that's what we claim we do. i.e. it's fine to use the
EFI variable for our own settings, but for the generic APIs to the
kernel cmdline we should not use it.
Specifically, this applies to IMPORT{cmdline} and
ConditionKernelCommandLine=. In the latter case we weren#t checking the
EFI variable anyway, hence let's do the same for the udev case, too.
Fixes: #15739
It annoyed me for quite a while that running "journalctl --file=…" on a
file that is not readable failed with a "File not found" error instead
of a permission error. Let's fix that.
We make this work by using the GLOB_NOCHECK flag for glob() which means
that files are not accessible will be returned in the array as they are
instead of being filtered away. This then means that our later attemps
to open the files will fail cleanly with a good error message.
kernel 5.6 added support for a new flag for getrandom(): GRND_INSECURE.
If we set it we can get some random data out of the kernel random pool,
even if it is not yet initializated. This is great for us to initialize
hash table seeds and such, where it is OK if they are crap initially. We
used RDRAND for these cases so far, but RDRAND is only available on
newer CPUs and some archs. Let's now use GRND_INSECURE for these cases
as well, which means we won't needlessly delay boot anymore even on
archs/CPUs that do not have RDRAND.
Of course we never set this flag when generating crypto keys or uuids.
Which makes it different from RDRAND for us (and is the reason I think
we should keep explicit RDRAND support in): RDRAND we don't trust enough
for crypto keys. But we do trust it enough for UUIDs.
As described in #15603, it is a fairly common setup to use a fqdn as the
configured hostname. But it is often convenient to use just the actual
hostname, i.e. until the first dot. This adds support in tmpfiles, sysusers,
and unit files for %l which expands to that.
Fixes#15603.
This new helper checks whether the specified locale is installed. It's
distinct from locale_is_valid() which just superficially checks if a
string looks like something that could be a valid locale.
Heavily inspired by @jsynacek's #13964.
Replaces: #13964
We always need to make them unions with a "struct cmsghdr" in them, so
that things properly aligned. Otherwise we might end up at an unaligned
address and the counting goes all wrong, possibly making the kernel
refuse our buffers.
Also, let's make sure we initialize the control buffers to zero when
sending, but leave them uninitialized when reading.
Both the alignment and the initialization thing is mentioned in the
cmsg(3) man page.