If a memory error occurred, we would still go through the path which sets the
error on ferror(). It is unlikely that ferror() returns true, but it's seems
cleaner to just propagate the error we already have.
The handling of fgets() returning NULL is also simplified: according to the man
page, it returns NULL only on EOF or error. So if feof() returns true, I don't
think we should call ferror() again.
While at it, let's set errno to 0 and check that it is set before returning it
as an error. The man pages for fgets() and feof() do not say anything about
setting errno.
Here the behaviour is nominally changed, because we will decrease the
counter on error. But the only caller quits the program if error occurs,
so this makes no practical difference.
Currently they aren't covered and it probably isn't worth adding another
kind of timestamp just for this, hence simply include it in the regular
generator timestamps.
Let's do so already when we are about to complete startup/reload, so
that manager_catchup() is run in a context where MANAGER_IS_RUNNING()
returns true, as the intention is.
Fixes: #9518
Both functions do partly the same, let's make sure they do it in the
same order, and that we don't miss some calls.
This makes a number of changes:
1. Moves exec_runtime_vacuum() two calls down in manager_startup(). This
should not have any effect but makes manager_startup() more like
manager_reload().
2. Calls manager_recheck_journal(), manager_recheck_dbus(),
manager_enqueue_sync_bus_names() in manager_startup() too. This is a
good idea since during reeexec we pass through manager_startup() and
hence can't assume dbus and journald weren't up yet, hence let's
check if they are ready to be connected to.
3. Include manager_enumerate_perpetual() in manager_reload(), too. This
is not strictly necessary, since these units are included in the
serialization anyway, but it's still a nice thing, in particular as
theoretically the deserialization could fail.
let's clean up error handling and logging in manager_reload() a bit.
Specifically: make sure we log about every error we might encounter at
least and at most once.
When we encounter an error before the "point of no return" then log at
LOG_ERR about it and propagate it. Otherwise, eat it up, but warn about
it and proceed, it's the best we can do.
If manager_serialize() fails in the middle (which it hopefully doesn't)
make sure to fix up m->n_reloading correctly again so that we don't
leave it > 0 when it really shouldn't be.
Let's make them typesafe, and let's add a nice macro helper for checking
if we are in a test run, which should make testing for this much easier
to read for most cases.
Instead of blacklisting when not to trim the cgroup tree, let's instead
whitelist when to do it, as an excercise of being careful when being
destructive.
This should not change behaviour with exception that during switch roots
we now won't attempt to trim the cgroup tree anymore. Which is more
correct behaviour after all we serialize/deserialize during the
transition and should be needlessly destructive.
"ExitCode" is a bit of a misnomer in two ways: it suggests this was
about the "exit code" concept that exit()/waitid() deal with, but really
isn't. Moreover, it's not event just about exiting either, but more
often about reloading/reexecing or rebooting. Let's hence pick a new
name for this that is a bit more correct.
I initially thought about naming this the "state", but that'd be a
misnomer too, as the value really encodes a "goal" more than a current
state. Also we already have the externally visible ManagerState.
No actual changes in behaviour, just the rename.
Cgroup v2 provides the eBPF-based device controller, which isn't currently
supported by systemd. This commit aims to provide such support.
There are no user-visible changes, just the device policy and whitelist
start working if cgroup v2 is used.
Bpf programs are charged against memlock ulimit, and the default value
can be too tight on machines with many cgroups and attached bpf programs.
Let's bump it to 64Mb.
Until a core dump handler is installed by systemd-sysctl, the generation of
core dump for services is turned OFF which can make the debugging of the early
boot process harder especially since there's no easy way to restore the core
dump generation.
This patch introduces a new kernel command line option which specifies an
absolute path where the kernel should write the core dump file when an early
process crashes.
This will take effect until systemd-coredump (or any other handlers) takes
over.
This shouldn't change control flow, with one exception: we won't send
notifications for boot progress to plymouth anymore during reload, which
is something we really shouldn't.
Allows configuring the watchdog signal (with a default of SIGABRT).
This allows an alternative to SIGABRT when coredumps are not desirable.
Appropriate references to SIGABRT or aborting were renamed to reflect
more liberal watchdog signals.
Closes#8658
Our logs are full of:
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldstat() / -10037, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call get_thread_area() / -10076, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call set_thread_area() / -10079, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldfstat() / -10034, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldolduname() / -10036, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldlstat() / -10035, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call waitpid() / -10073, ignoring: Numerical argument out of domain
...
This is pointless and makes debug logs hard to read. Let's keep the logs
in test code, but disable it in nspawn and pid1. This is done through a function
parameter because those functions operate recursively and it's not possible to
make the caller to log meaningfully.
There should be no functional change, except the skipped debug logs.
Since the commit torvalds/linux@fdb5c4531c
the syscall BPF_PROG_ATTACH return EBADF when CONFIG_CGROUP_BPF is
turned off and as result the bpf_firewall_supported() returns the
incorrect value.
This commmit replaces the syscall BPF_PROG_ATTACH with BPF_PROG_DETACH
which is still work as expected.
Resolvesopenbmc/linux#159
See also systemd/systemd#7054
Signed-off-by: Alexander Filippov <a.filippov@yadro.com>
Without this, log shows meaningless error message 'No anode', e.g.,
===
Failed to unshare the mount namespace: Operation not permitted
foo.service: Failed to set up mount namespacing: No anode
foo.service: Failed at step NAMESPACE spawning /usr/bin/test: No anode
===
Follow-up for 1beab8b0d0.
When loading units, sometimes we'd first encounter a unit from .wants or
.requires directory. A typical case would be when multi-user.target.wants/
contains a symlink to some unit. We would prepare to load this unit using
/etc/systemd/system/multi-user.target.wants/foo.service as the fragment
path. This is always wrong. Instead, let's use NULL as the path and let
manager_load_unit() figure out the path on its own.
Fixes#9921.
path=0x5625ed9b01a0 "/usr/lib/systemd/system/local-fs.target.wants/systemd-remount-fs.service", e=0x0,
_ret=0x7ffe64645000) at ../src/core/manager.c:1887
name=0x5625ed9b01ce "systemd-remount-fs.service",
path=0x5625ed9b01a0 "/usr/lib/systemd/system/local-fs.target.wants/systemd-remount-fs.service", e=0x0,
_ret=0x7ffe64645000) at ../src/core/manager.c:1961
name=0x5625ed9b01ce "systemd-remount-fs.service",
path=0x5625ed9b01a0 "/usr/lib/systemd/system/local-fs.target.wants/systemd-remount-fs.service",
add_reference=true, mask=UNIT_DEPENDENCY_FILE) at ../src/core/unit.c:2946
dir_suffix=0x5625ebb179ed ".wants") at ../src/core/load-dropin.c:95
path=0x0, e=0x0, _ret=0x7ffe646452c0) at ../src/core/manager.c:1965
name=0x5625ebb186f8 "local-fs.target", path=0x0, add_reference=true,
mask=UNIT_DEPENDENCY_MOUNTINFO_IMPLICIT) at ../src/core/unit.c:2946
where=0x5625ed9b3cc0 "/tmp", options=0x5625ed947110 "rw,nosuid,nodev,seclabel",
fstype=0x5625ed95be90 "tmpfs", flags=0x7ffe64645395) at ../src/core/mount.c:1439
where=0x5625ed9b3cc0 "/tmp", options=0x5625ed947110 "rw,nosuid,nodev,seclabel",
fstype=0x5625ed95be90 "tmpfs", set_flags=false) at ../src/core/mount.c:1567
at ../src/core/mount.c:1635
ret_retval=0x7ffe64645660, ret_shutdown_verb=0x7ffe646456c0, ret_fds=0x7ffe646456d8,
ret_switch_root_dir=0x7ffe646456b0, ret_switch_root_init=0x7ffe646456b8,
ret_error_message=0x7ffe646456c8) at ../src/core/main.c:1669
The MountEntry's added for EMPTY_DIR work very similarly to the TMPFS ones.
In both cases, .has_prefix is false. In fact, .has_prefix is false in
*all* the MountEntry's we add except for the access mounts (READONLY etc).
But EMPTY_DIR stuck out by explicitly setting .has_prefix = false.
Let's remove that.
... when no mount options are passed.
Change the code, to avoid the following failure in the newly added tests:
exec-temporaryfilesystem-rw.service: Executing: /usr/bin/sh -x -c
'[ "$(stat -c %a /var)" == 755 ]'
++ stat -c %a /var
+ '[' 1777 == 755 ']'
Received SIGCHLD from PID 30364 (sh).
Child 30364 (sh) died (code=exited, status=1/FAILURE)
(And I spotted an opportunity to use TAKE_PTR() at the end).
We can't remount the underlying superblocks, if we are inside a user
namespace and running Linux <= 4.17. We can only change the per-mount
flags (MS_REMOUNT | MS_BIND).
This type of mount() call can only change the per-mount flags, so we
don't have to worry about passing the right string options now.
Fixes#9914 ("Since 1beab8b was merged, systemd has been failing to start
systemd-resolved inside unprivileged containers" ... "Failed to re-mount
'/run/systemd/unit-root/dev' read-only: Operation not permitted").
> It's basically my fault :-). I pointed out we could remount read-only
> without MS_BIND when reviewing the PR that added TemporaryFilesystem=,
> and poettering suggested to change PrivateDevices= at the same time.
> I think it's safe to change back, and I don't expect anyone will notice
> a difference in behaviour.
>
> It just surprised me to realize that
> `TemporaryFilesystem=/tmp:size=10M,ro,nosuid` would not apply `ro` to the
> superblock (underlying filesystem), like mount -osize=10M,ro,nosuid does.
> Maybe a comment could note the kernel version (v4.18), that lets you
> remount without MS_BIND inside a user namespace.
This makes the code longer and I guess this function is still ugly, sorry.
One obstacle to cleaning it up is the interaction between
`PrivateDevices=yes` and `ReadOnlyPaths=/dev`. I've added a test for the
existing behaviour, which I think is now the correct behaviour.
This makes two changes to the namespacing code:
1. We'll only gracefully skip service namespacing on access failure if
exclusively sandboxing options where selected, and not mount-related
options that result in a very different view of the world. For example,
ignoring RootDirectory=, RootImage= or Bind= is really probablematic,
but ReadOnlyPaths= is just a weaker sandbox.
2. The namespacing code will now return a clearly recognizable error
code when it cannot enforce its namespacing, so that we cannot
confuse EPERM errors from mount() with those from unshare(). Only the
errors from the first unshare() are now taken as hint to gracefully
disable namespacing.
Fixes: #9844#9835
The fstab entry may contain comment/application-specific options, like
for example x-systemd.automount or x-initrd.mount.
With the recent switch to libmount, the mount options during remount are
now gathered via mnt_fs_get_options(), which returns the merged fstab
options with the effective options in mountinfo.
Unfortunately if one of these application-specific options are set in
fstab, the remount will fail with -EINVAL.
In systemd 238:
Remounting '/test-x-initrd-mount' read-only in with options
'errors=continue,user_xattr,acl'.
In systemd 239:
Remounting '/test-x-initrd-mount' read-only in with options
'errors=continue,user_xattr,acl,x-initrd.mount'.
Failed to remount '/test-x-initrd-mount' read-only: Invalid argument
So instead of using mnt_fs_get_options(), we're now using both
mnt_fs_get_fs_options() and mnt_fs_get_vfs_options() and merging the
results together so we don't get any non-relevant options from fstab.
Signed-off-by: aszlig <aszlig@nix.build>
Let's fold get_user_creds_clean() into get_user_creds(), and introduce a
flags argument for it to select "clean" behaviour. This flags parameter
also learns to other new flags:
- USER_CREDS_SYNTHESIZE_FALLBACK: in this mode the user records for
root/nobody are only synthesized as fallback. Normally, the synthesized
records take precedence over what is in the user database. With this
flag set this is reversed, and the user database takes precedence, and
the synthesized records are only used if they are missing there. This
flag should be set in cases where doing NSS is deemed safe, and where
there's interest in knowing the correct shell, for example if the
admin changed root's shell to zsh or suchlike.
- USER_CREDS_ALLOW_MISSING: if set, and a UID/GID is specified by
numeric value, and there's no user/group record for it accept it
anyway. This allows us to fix#9767
This then also ports all users to set the most appropriate flags.
Fixes: #9767
[zj: remove one isempty() call]
On the host these symlinks are created by udev, and we consider them API
and make use of them ourselves at various places. Hence when running a
private /dev, also create these symlinks so that lookups by major/minor
work in such an environment, too.
When stdin/stdout/stderr is initialized from an fd, let's read the tty
name of it if we can, and pass that to PAM.
This makes sure that "machinectl shell" sessions have proper TTY fields
initialized that "loginctl" then shows.
This is a bit like the info link in most of GNU's --help texts, but we
don't do info but man pages, and we make them properly clickable on
terminal supporting that, because awesome.
I think it's generally advisable to link up our (brief) --help texts and
our (more comprehensive) man pages a bit, so this should be an easy and
straight-forward way to do it.
Previously, we'd act immediately on StopWhenUnneeded= when a unit state
changes. With this rework we'll maintain a queue instead: whenever
there's the chance that StopWhenUneeded= might have an effect we enqueue
the unit, and process it later when we have nothing better to do.
This should make the implementation a bit more reliable, as the unit notify event
cannot immediately enqueue tons of side-effect jobs that might
contradict each other, but we do so only in a strictly ordered fashion,
from the main event loop.
This slightly changes the check when to consider a unit "unneeded".
Previously, we'd assume that a unit in "deactivating" state could also
be cleaned up. With this new logic we'll only consider units unneeded
that are fully up and have no job queued. This means that whenever
there's something pending for a unit we won't clean it up.
Looked for definitions of functions using the *_compare_func() suffix.
Tested:
- Unit tests passed (ninja -C build/ test)
- Installed this build and booted with it.
RootImage= may require the following settings
```
DeviceAllow=/dev/loop-control rw
DeviceAllow=block-loop rwm
DeviceAllow=block-blkext rwm
```
This adds the following settings implicitly when RootImage= is
specified.
Fixes#9737.
These take a struct iovec to send data together with the passed FD.
The receive function returns the FD through an output argument. In case data is
received, but no FD is passed, the receive function will set the output
argument to -1 explicitly.
Update code in dynamic-user to use the new helpers.
When DynamicUser=yes and static User= are set, and the user has
different uid and gid, then as the storage socket for the dynamic
user does not contains gid, we need to obtain gid.
Follow-up for 9ec655cbbd.
Fixes#9702.
Users are often surprised that "systemd-run" command lines like
"systemd-run -p User=idontexist /bin/true" will return successfully,
even though the logs show that the process couldn't be invoked, as the
user "idontexist" doesn't exist. This is because Type=simple will only
wait until fork() succeeded before returning start-up success.
This patch adds a new service type Type=exec, which is very similar to
Type=simple, but waits until the child process completed the execve()
before returning success. It uses a pipe that has O_CLOEXEC set for this
logic, so that the kernel automatically sends POLLHUP on it when the
execve() succeeded but leaves the pipe open if not. This means PID 1
waits exactly until the execve() succeeded in the child, and not longer
and not shorter, which is the desired functionality.
Making use of this new functionality, the command line
"systemd-run -p User=idontexist -p Type=exec /bin/true" will now fail,
as expected.
When process fd lists to pass to activated programs we always place the
socket activation fds first, and the storage fds last. Irritatingly in
almost all calls the "n_storage_fds" parameter (i.e. the number of
storage fds to pass) came first so far, and the "n_socket_fds" parameter
second. Let's clean this up, and specify the number of fds in the order
the fds themselves are passed.
(Also, let's fix one more case where "unsigned" was used to size an
array, while we should use "size_t" instead.)
We so far had various placed we'd parse percentages with
parse_percent(). Let's make them use parse_permille() instead, which is
downward compatible (as it also parses percent values), and increases
the granularity a bit. Given that on the wire we usually normalize
relative specifications to something like UINT32_MAX anyway changing
from base-100 to base-1000 calculations can be done easily without
breaking compat.
This commit doesn't document this change in the man pages. While
allowing more precise specifcations permille is not as commonly
understood as perent I guess, hence let's keep this out of the docs for
now.
Usecase is to allow changing the final kill from SIGKILL to SIGQUIT which
should create a core dump useful for debugging why the service didn't stop
with the SIGTERM
Whenever a unit is started fresh we should flush out any runtime data
from the previous cycle. We are pretty good at that already, but what so
far we missed was the ExecStart=/ExecStop=/… command exit status data.
Let's fix that, and properly flush out that stuff too.
Consider this service:
[Service]
ExecStart=/bin/sleep infinity
ExecStop=/bin/false
When this service is started, then stopped and then started again
"systemctl status" would show the ExecStop= results of the previous run
along with the ExecStart= results of the current one, which is very
confusing. With this patch this is corrected: the data is kept right
until the moment the new service cycle starts, and then flushed out.
Hence "systemctl status" in that case will only show the ExecStart=
data, but no ExecStop= data, like it should be.
This should fix part of the confusion of #9588
We always initialize it from the same field in ExecCommand anyway, hence
there's no point in passing it separately to exec_spawn(), after all we
already pass the ExecCommand structure itself anyway.
No change in behaviour.