There are downsides to using fexecve:
when fexecve is used (for normal executables), /proc/pid/status shows Name: 3,
which means that ps -C foobar doesn't work. pidof works, because it checks
/proc/self/cmdline. /proc/self/exe also shows the correct link, but requires
privileges to read. /proc/self/comm also shows "3".
I think this can be considered a kernel deficiency: when O_CLOEXEC is used, this
"3" is completely meaningless. It could be any number. The kernel should use
argv[0] instead, which at least has *some* meaning.
I think the approach with fexecve/execveat is instersting, so let's provide it
as opt-in.
For scripts, when we call fexecve(), on new kernels glibc calls execveat(),
which fails with ENOENT, and then we fall back to execve() which succeeds:
[pid 63039] execveat(3, "", ["/home/zbyszek/src/systemd/test/test-path-util/script.sh", "--version"], 0x7ffefa3633f0 /* 0 vars */, AT_EMPTY_PATH) = -1 ENOENT (No such file or directory)
[pid 63039] execve("/home/zbyszek/src/systemd/test/test-path-util/script.sh", ["/home/zbyszek/src/systemd/test/test-path-util/script.sh", "--version"], 0x7ffefa3633f0 /* 0 vars */) = 0
But on older kernels glibc (some versions?) implement a fallback which falls
into the same trap with bash $0:
[pid 13534] execve("/proc/self/fd/3", ["/home/test/systemd/test/test-path-util/script.sh", "--version"], 0x7fff84995870 /* 0 vars */) = 0
We don't want that, so let's call execveat() ourselves. Then we can do the
execve() fallback as we want.
FixedRandomDelay=yes will use
`siphash24(sd_id128_get_machine() || MANAGER_IS_SYSTEM(m) || getuid() || u->id)`,
where || is concatenation, instead of a random number to choose a value between
0 and RandomizedDelaySec= as the timer delay.
This essentially sets up a fixed, but seemingly random, offset for each timer
iteration rather than having a random offset recalculated each time it fires.
Closes#10355
Co-author: Anita Zhang <the.anitazha@gmail.com>
Quoting the manual page of stime(2): "Starting with glibc 2.31, this function
is no longer available to newly linked applications and is no longer declared
in <time.h>."
This beefs up the READ_FULL_FILE_CONNECT_SOCKET logic of
read_full_file_full() a bit: when used a sender socket name may be
specified. If specified as NULL behaviour is as before: the client
socket name is picked by the kernel. But if specified as non-NULL the
client can pick a socket name to use when connecting. This is useful to
communicate a minimal amount of metainformation from client to server,
outside of the transport payload.
Specifically, these beefs up the service credential logic to pass an
abstract AF_UNIX socket name as client socket name when connecting via
READ_FULL_FILE_CONNECT_SOCKET, that includes the requesting unit name
and the eventual credential name. This allows servers implementing the
trivial credential socket logic to distinguish clients: via a simple
getpeername() it can be determined which unit is requesting a
credential, and which credential specifically.
Example: with this patch in place, in a unit file "waldo.service" a
configuration line like the following:
LoadCredential=foo:/run/quux/creds.sock
will result in a connection to the AF_UNIX socket /run/quux/creds.sock,
originating from an abstract namespace AF_UNIX socket:
@$RANDOM/unit/waldo.service/foo
(The $RANDOM is replaced by some randomized string. This is included in
the socket name order to avoid namespace squatting issues: the abstract
socket namespace is open to unprivileged users after all, and care needs
to be taken not to use guessable names)
The services listening on the /run/quux/creds.sock socket may thus
easily retrieve the name of the unit the credential is requested for
plus the credential name, via a simpler getpeername(), discarding the
random preifx and the /unit/ string.
This logic uses "/" as separator between the fields, since both unit
names and credential names appear in the file system, and thus are
designed to use "/" as outer separators. Given that it's a good safe
choice to use as separators here, too avoid any conflicts.
This is a minimal patch only: the new logic is used only for the unit
file credential logic. For other places where we use
READ_FULL_FILE_CONNECT_SOCKET it is probably a good idea to use this
scheme too, but this should be done carefully in later patches, since
the socket names become API that way, and we should determine the right
amount of info to pass over.
* Existing valid rule files written with KEY="value" are not affected
* Now, KEY=e"value\n" becomes valid. Where `\n` is a newline character
* Escape sequences supported by src/basic/escape.h:cunescape() is
supported
We had two of each: both homectl and journalctl had the whole dlopen()
wrapper, and journalctl had two implementations (slightly different) of the
code to print the fss:// pattern.
print_qrcode() now returns -EOPNOTSUPP when compiled with qrcode support. Both
callers ignore the return value, so this changes nothing.
No functional change.
This adds a way to control SO_TIMESTAMP/SO_TIMESTAMPNS socket options
for sockets PID 1 binds to.
This is useful in journald so that we get proper timestamps even for
ingress log messages that are submitted before journald is running.
We recently turned on packet info metadata from PID 1 for these sockets,
but the timestamping info was still missing. Let's correct that.
Let's try to make collisions when multiple clients want to use the same
device less likely, by sleeping a random time on collision.
The loop device allocation protocol is inherently collision prone:
first, a program asks which is the next free loop device, then it tries
to acquire it, in a separate, unsynchronized setp. If many peers do this
all at the same time, they'll likely all collide when trying to
acquire the device, so that they need to ask for a free device again and
again.
Let's make this a little less prone to collisions, reducing the number
of failing attempts: whenever we notice a collision we'll now wait
short and randomized time, making it more likely another peer succeeds.
(This also adds a similar logic when retrying LOOP_SET_STATUS64, but
with a slightly altered calculation, since there we definitely want to
wait a bit, under all cases)
On systems that have a udev before
a7fdc6cbd3 uevents would sometimes be
eaten because of device node collisions that caused the ruleset to fail.
Let's add an (ugly) work-around for this, so that we can even work with
such an older udev.
On current kernels (5.8 for example) under some conditions I don't fully
grok it might happen that a detached loopback block device still has
partition block devices around. Accessing these partition block devices
results in EIO errors (that also fill up dmesg). These devices cannot be
claned up with LOOP_CLR_FD (since the main device already is officially
detached), nor with LOOP_CTL_DELETE (returns EBUSY as long as the
partitions still exist). This is a kernel bug. But it appears to apply
to all recent kernels. I cannot really pin down what triggers this,
suffice to say our heavy-duty test can trigger it.
Either way, let's do something about it: when we notice this state we'll
attach an empty file to it, which is guaranteed to have to part table.
This makes the partitions go away. After closing/reoping the device we
hence are good to go again. ugly workaround, but I think OK enough to
use.
The net result is: with this commit, we'll guarantee that by the time we
attach a file to the loopback device we have zero kernel partitions
associated with it. Thus if we then wait for the kernel partitions we
need to appear we should have entirely reliable behaviour even if
loopback devices by the name are heavily recycled and udev events reach
us very late.
Fixes: #16858
Previously, we'd just wait for the first moment where the kernel exposes
the same numbre of partitions as libblkid tells us. After that point we
enumerate kernel partitions and look for matching libblkid partitions.
With this change we'll instead enumerate with libblkid only, and then
wait for each kernel partition to show up with the exact parameters we
expect them to show up. Once that happens we are happy.
Care is taken to use the udev device notification messages only as hint
to recheck what the kernel actually says. That's because we are
otherwise subject to a race: we might see udev events from an earlier
use of a loopback device. After all these devices are heavily recycled.
Under the assumption that we'll get udev events for *at least* all
partitions we care about (but possibly more) we can fix the race
entirely with one more fix coming in a later commit: if we make sure
that a loopback block device has zero kernel partitions when we take
possession of it, it doesn't matter anymore if we get spurious udev
events from a previous use. All we have to do is notice when the devices
we need all popped up.
When we fall back to classic LOOP_SET_FD logic in case LOOP_CONFIGURE
didn't work we issue LOOP_CLR_FD first. But that call turns out to be
potentially async in the kernel: if something else (let's say
udev/blkid) is accessing the device the ioctl just sets the autoclear
flag and exits. Hence quite often the LOOP_SET_FD will subsequently
fail. Let's avoid the trouble, and immediately exit with EBUSY if
LOOP_CONFIGURE fails, and but remember that LOOP_CONFIGURE is not
available so that on the next iteration we go directly for LOOP_SET_FD
instead.
v246 is long released. Hence the new scheme should be named v247.
(Interesting, how we pretty systematically for the last releases changed
the scheme only every second release)
The idea is that we have strvs like list of server names or addresses, where
the majority of strings is rather short, but some are long and there can
potentially be many strings. So formattting them either all on one line or all
in separate lines leads to output that is either hard to read or uses way too
many rows. We want to wrap them, but relying on the pager to do the wrapping is
not nice. Normal text has a lot of redundancy, so when the pager wraps a line
in the middle of a word the read can understand what is going on without any
trouble. But for a high-density zero-redundancy text like an IP address it is
much nicer to wrap between words. This also makes c&p easier.
This adds a variant of TABLE_STRV which is wrapped on output (with line breaks
inserted between different strv entries).
The change table_print() is quite ugly. A second pass is added to re-calculate
column widths. Since column size is now "soft", i.e. it can adjust based on
available columns, we need to two passes:
- first we figure out how much space we want
- in the second pass we figure out what the actual wrapped columns
widths will be.
To avoid unnessary work, the second pass is only done when we actually have
wrappable fields.
A test is added in test-format-table.
Failed to enter shared memory directory multipath: Permission denied
→
Failed to enter shared memory directory /dev/shm/multipath: Permission denied
When looking at nested directories, we will print only the final two elements
of the path. That is still more useful than just the last component of the
path. To print the full path, we'd have to allocate the string, and since the
error occurs so very rarely, I think the current best-effort approach is
enough.
By making them unsigned comparing them with other sizes is less likely
to trigger compiler warnings regarding signed/unsigned comparisons.
After all sizes (i.e. size_t) are generally assumed to be unsigned, so
these should be too.
Prompted-by: https://github.com/systemd/systemd/pull/17345#issuecomment-709402332
If the first boot was aborted, /etc/machine-id might read as
"uninitialized" in some cases. Add a separate case for this
instead of printing a confusing error message.
p itself is never null. Because of this, we would always
call sd_notify() in cleanup, even though the intention was to only
call it if notify_start() was executed.
I can't think of any real vulnerability about this, but it still feels
better to check a variable with "secure" in its name with
secure_getenv() rather than plain getenv().
Paranoia FTW!
We would return ENOENT, which is extremely confusing. Strace is not helpful because
no *file* is actually missing. So let's add some logs at debug level and also use
a custom return code. Let all user-facing utilities print a custom error message
in that case.
The variable is renamed to SYSTEMD_PAGERSECURE (because it's not just about
less now), and we automatically enable secure mode in certain cases, but not
otherwise.
This approach is more nuanced, but should provide a better experience for
users:
- Previusly we would set LESSSECURE=1 and trust the pager to make use of
it. But this has an effect only on less. We need to not start pagers which
are insecure when in secure mode. In particular more is like that and is a
very popular pager.
- We don't enable secure mode always, which means that those other pagers can
reasonably used.
- We do the right thing by default, but the user has ultimate control by
setting SYSTEMD_PAGERSECURE.
Fixes#5666.
v2:
- also check $PKEXEC_UID
v3:
- use 'sd_pid_get_owner_uid() != geteuid()' as the condition
While a server is in the VARLINK_PENDING_METHOD or VARLINK_PENDING_METHOD_MORE
states and its write end is disconnected and it gets a POLLHUP, we
should disconnect since it can't write anymore.
In the case of systemd-oomd disconnecting while pid1 was pending-more, this
condition left pid1 in a state where it started throttling from
continually getting POLLHUP.
It's an auxiliary function to the UnitInfo structures, and very generic.
Let's hence move it over to the other code operating with UnitInfo, even
if it's not used by code outside of systemctl (yet).
Some extra safety when invoked via "sudo". With this we address a
genuine design flaw of sudo, and we shouldn't need to deal with this.
But it's still a good idea to disable this surface given how exotic it
is.
Prompted by #5666
If we set O_NONBLOCK on stdin/stdout directly this means the flag is
left set when we abort abnormally, as we don't get the chance to reset
it again on exit. This might confuse progrms invoking us. Moreover, if
programs invoking us continue to write to the stdout passed to us, they
might be confused by non-blocking mode too.
Hence, let's avoid this if possible: let's reopen stdin/stdout and set
O_NONBLOCK only on the reopend fds, leaving the original fds as they
are.
Prompted-by: https://github.com/systemd/systemd/pull/17070#issuecomment-702304802
Also, even if login.defs are not present, don't start allocating at 1, but at
SYSTEM_UID_MIN.
Fixes#9769.
The test is adjusted. Actually, it was busted before, because sysusers would
never use SYSTEM_GID_MIN, so if SYSTEM_GID_MIN was different than
SYSTEM_UID_MIN, the tests would fail. On all "normal" systems the two are
equal, so we didn't notice. Since sysusers now always uses the minimum of the
two, we only need to substitute one value.
We don't (and shouldn't I think) look at them when determining the type of the
user, but they should be used during user/group allocation. (For example, an
admin may specify SYS_UID_MIN==200 to allow statically numbered users that are
shared with other systems in the range 1–199.)
It makes little sense to make the boundary between systemd and user guids
configurable. Nevertheless, a completely fixed compile-time define is not
enough in two scenarios:
- the systemd_uid_max boundary has moved over time. The default used to be
500 for a long time. Systems which are upgraded over time might have users
in the wrong range, but changing existing systems is complicated and
expensive (offline disks, backups, remote systems, read-only media, etc.)
- systems are used in a heterogenous enviornment, where some vendors pick
one value and others another.
So let's make this boundary overridable using /etc/login.defs.
Fixes#3855, #10184.
This is like membarrier() I guess and basically just exposes CPU
functionality via kernel syscall on some archs. Let's whitelist it for
everyone.
Fixes: #17197
After https://github.com/systemd/systemd/pull/16981 only the presence of crypt_gensalt_ra
is checked, but there are cases where that function is available but crypt_preferred_method
is not, and they are used in the same ifdef.
Add a check for the latter as well.
Fixes#17035. We use "," as the separator between arguments in fstab and crypttab
options field, but the kernel started using "," within arguments. Users will need
to escape those nested commas.
If the whole call is simple and we don't need to look at the return value
apart from the conditional, let's use a form without assignment of the return
value. When the function call is more complicated, it still makes sense to
use a temporary variable.
Let's make umount_verbose() more like mount_verbose_xyz(), i.e. take log
level and flags param. In particular the latter matters, since we
typically don't actually want to follow symlinks when unmounting.
Let's suppress the secondary arch data, since we never ever want to
mount it if we found the primary arch.
Previously we only suppressed in the Verity case, but there's little
reason to entertain the idea of a secondary arch in non-Verity
environments either, we are not going to use them, and should not do
decryption or anything like that.
Uppercase first char of log message, and indicate correct program name.
Reindent comment table at one place.
Use correct, specific, enum type at one more place.
Just some refactoring: let's place the various verity related parameters
in a common structure, and pass that around instead of the individual
parameters.
Also, let's load the PKCS#7 signature data when finding metadata
right-away, instead of delaying this until we need it. In all cases we
call this there's not much time difference between the metdata finding
and the loading, hence this simplifies things and makes sure root hash
data and its signature is now always acquired together.
With new directive SystemCallLog= it's possible to list system calls to be
logged. This can be used for auditing or temporarily when constructing system
call filters.
---
v5: drop intermediary, update HASHMAP_FOREACH_KEY() use
v4: skip useless debug messages, actually parse directive
v3: don't declare unused variables with old libseccomp
v2: fix build without seccomp or old libseccomp