FixedRandomDelay=yes will use
`siphash24(sd_id128_get_machine() || MANAGER_IS_SYSTEM(m) || getuid() || u->id)`,
where || is concatenation, instead of a random number to choose a value between
0 and RandomizedDelaySec= as the timer delay.
This essentially sets up a fixed, but seemingly random, offset for each timer
iteration rather than having a random offset recalculated each time it fires.
Closes#10355
Co-author: Anita Zhang <the.anitazha@gmail.com>
This beefs up the READ_FULL_FILE_CONNECT_SOCKET logic of
read_full_file_full() a bit: when used a sender socket name may be
specified. If specified as NULL behaviour is as before: the client
socket name is picked by the kernel. But if specified as non-NULL the
client can pick a socket name to use when connecting. This is useful to
communicate a minimal amount of metainformation from client to server,
outside of the transport payload.
Specifically, these beefs up the service credential logic to pass an
abstract AF_UNIX socket name as client socket name when connecting via
READ_FULL_FILE_CONNECT_SOCKET, that includes the requesting unit name
and the eventual credential name. This allows servers implementing the
trivial credential socket logic to distinguish clients: via a simple
getpeername() it can be determined which unit is requesting a
credential, and which credential specifically.
Example: with this patch in place, in a unit file "waldo.service" a
configuration line like the following:
LoadCredential=foo:/run/quux/creds.sock
will result in a connection to the AF_UNIX socket /run/quux/creds.sock,
originating from an abstract namespace AF_UNIX socket:
@$RANDOM/unit/waldo.service/foo
(The $RANDOM is replaced by some randomized string. This is included in
the socket name order to avoid namespace squatting issues: the abstract
socket namespace is open to unprivileged users after all, and care needs
to be taken not to use guessable names)
The services listening on the /run/quux/creds.sock socket may thus
easily retrieve the name of the unit the credential is requested for
plus the credential name, via a simpler getpeername(), discarding the
random preifx and the /unit/ string.
This logic uses "/" as separator between the fields, since both unit
names and credential names appear in the file system, and thus are
designed to use "/" as outer separators. Given that it's a good safe
choice to use as separators here, too avoid any conflicts.
This is a minimal patch only: the new logic is used only for the unit
file credential logic. For other places where we use
READ_FULL_FILE_CONNECT_SOCKET it is probably a good idea to use this
scheme too, but this should be done carefully in later patches, since
the socket names become API that way, and we should determine the right
amount of info to pass over.
This adds a way to control SO_TIMESTAMP/SO_TIMESTAMPNS socket options
for sockets PID 1 binds to.
This is useful in journald so that we get proper timestamps even for
ingress log messages that are submitted before journald is running.
We recently turned on packet info metadata from PID 1 for these sockets,
but the timestamping info was still missing. Let's correct that.
If processes remain in the unit's cgroup after the final SIGKILL is
sent and the unit has exceeded stop timeout, don't release the unit's
cgroup information. Pid1 will have failed to `rmdir` the cgroup path due
to processes remaining in the cgroup and releasing would leave the cgroup
path on the file system with no tracking for pid1 to clean it up.
Instead, keep the information around until the last process exits and pid1
sends the cgroup empty notification. The service/scope can then prune
the cgroup if the unit is inactive/failed.
This changes the default from putting all units into the root slice to
placing them into the app slice in the user manager. The advantage is
that we get the right behaviour in most cases, and we'll need special
case handling in all other cases anyway.
Note that we have currently defined that applications *should* start
their unit names with app-, so we could also move only these by creating
a drop-in for app-.scope and app-.service.
However, that would not answer the question on how we should manage
session.slice. And we would end up placing anything that does not fit
the system (e.g. anything started by dbus-broker currently) into the
root slice.
The test was failing because it couldn't start the service:
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
Failed to connect to system bus: No such file or directory
-.slice: Failed to enable/disable controllers on cgroup /system.slice/kojid.service, ignoring: Permission denied
path-modified.service: Failed to create cgroup /system.slice/kojid.service/path-modified.service: Permission denied
path-modified.service: Failed to attach to cgroup /system.slice/kojid.service/path-modified.service: No such file or directory
path-modified.service: Failed at step CGROUP spawning /bin/true: No such file or directory
path-modified.service: Main process exited, code=exited, status=219/CGROUP
path-modified.service: Failed with result 'exit-code'.
Test timeout when testing path-modified.path
In fact any of the services that we try to start may fail, especially
considering that we're doing some rogue cgroup operations. See
https://github.com/systemd/systemd/pull/16603#issuecomment-679133641.
sync() before committing a transient machine-id to disk. This will
ensure that any filesystem changes made by first-boot units will have
been persisted before the first boot is marked as completed.
Currently, a loss of power after the machine-id was written but before
all units with ConditionFirstBoot=yes ran would lead to the next boot
finding a valid machine-id, thus not being marked first boot and not
re-running these units.
To make the first boot mechanism more robust, instead of writing
/etc/machine-id very early, fill it with a marker value "uninitialized"
and overmount it with a transiently provisioned machine-id. Then, after
the first boots completes (when systemd-machine-id-commit.service runs),
write the real machine-id to disk.
This mechanism is of course only invoked on first boot. If a first boot
is not detected, the machine-id is handled as previously.
Fixes: #4511
When /etc/machine-id contains the string "uninitialized" instead of
a valid machine-id, treat this like the file was missing and mark this
boot as the first (-> units with ConditionFirstBoot=yes will run).
let's add informational logging about each client requested signal
sending. While we are at, let's beef up error handling/log messages in
this case quite a bit: let's log errors both to syslog and report errors
back to client.
Fixes: #17254
--kill-who=main-fail never worked correctly, due to a copy and paste
mistake in ac5e3a505e, where the same item
was listed twice. The mistake was
later noticed, but fixed incorrectly, in
201f0c916d.
Let's list all *-fail types correctly, finally.
And while we are at it, add a nice comment and generate a prettier D-Bus
error about this.
Let's mark the whole /run/host hierarchy as something to ignore by PID 1
for generation of .mount units, i.e. consider it as "extrinsic".
By unifying container mgr supplied resources in one dir it's also easy
to exclude the whole lot from PID1's management inside the container.
This is the right thing to do, since from the payload's PoV these mounts
are just API and not manipulatable as they are established, managed and
owned by the container manager, not the payload.
(While we are it, also add the boot ID mount to the existing list, as
nspawn and other container managers overmount that too, typically, and
it is thus owned by the container manager and not the payload
typically.)
We close fds in two phases, first some and then the some more. When passing
a list of fds to exclude from closing to the closing function, we would
pass some in an array and the rest as separate arguments. For the fds which
should be excluded in both closing phases, let's always create the array
and put the relevant fds there. This has the advantage that if more fds to
exclude in both phases are added later, we don't need to add more positional
arguments.
The list passed to setup_pam() is not changed. I think we could pass more fds
to close there, but I'm leaving that unchanged.
The setting of FD_CLOEXEC on an already open fds is dropped. The fd is opened
in service_allocate_exec_fd() and there is no reason to suspect that it might
have been opened incorrectly. If some rogue code is unsetting our FD_CLOEXEC
bits, then it might flip any fd, no reason to single this one out.
We don't (and shouldn't I think) look at them when determining the type of the
user, but they should be used during user/group allocation. (For example, an
admin may specify SYS_UID_MIN==200 to allow statically numbered users that are
shared with other systems in the range 1–199.)
Fixes#16991fb39af4ce4 replaced `free_arguments()` with
`reset_arguments()`, which frees arg_* variables as before, but also resets all
of them to the default values. `reset_arguments()` was positioned
in such a way that it overrode some arg_* values still in use at shutdown.
To avoid further unintentional resets, I moved `reset_arguments()`
right before the return, when nothing else will be using the arg_* variables.
If a namespace with PrivateTmp=true is constructed we need to restore
the context of the namespaces /tmp directory (i.e.
/tmp/systemd-private-XXXXX/tmp) to the (default) context of /tmp .
Otherwise filetransitions might result in the namespaces tmp directory
having the wrong context.
If the whole call is simple and we don't need to look at the return value
apart from the conditional, let's use a form without assignment of the return
value. When the function call is more complicated, it still makes sense to
use a temporary variable.
Lennart wanted to do this back in
01c33c1eff.
For better or worse, this wasn't done because I thought that turning on MountAPIVFS
is a compat break for RootDirectory and people might be negatively surprised by it.
Without this, search for binaries doesn't work (access_fd() requires /proc).
Let's turn it on, but still allow overriding to "no".
When RootDirectory=/, MountAPIVFS=1 doesn't work. This might be a buglet on its
own, but this patch doesn't change the situation.
(Well, at least the ones where that makes sense. Where it does't make
sense are the ones that re invoked on the root path, which cannot
possibly be a symlink.)
Let's make umount_verbose() more like mount_verbose_xyz(), i.e. take log
level and flags param. In particular the latter matters, since we
typically don't actually want to follow symlinks when unmounting.
This has two advantages:
- we save a bit of IO in early boot because we don't look for executables
which we might never call
- if the executable is in a different place and it was specified as a
non-absolute path, it is OK if it moves to a different place. This should
solve the case paths are different in the initramfs.
Since the executable path is only available quite late, the call to
mac_selinux_get_child_mls_label() which uses the path needs to be moved down
too.
Fixes#16076.
Similar to free_and_replace. I think this should be uppercase to make it
clear that this is a macro. free_and_replace should probably be uppercased
too.
Other tools silently ignore non-executable names found in path. By checking
F_OK, we would could pick non-executable path even though there is an executable
one later.
We had a special test case that the second semicolon would be interpreted
as an executable name. We would then try to find the executable and rely
on ";" not being found to cause ENOEXEC to be returned. I think that's just
crazy. Let's treat the second semicolon as a separator and ignore the
whole thing as we would whitespace.
Just some refactoring: let's place the various verity related parameters
in a common structure, and pass that around instead of the individual
parameters.
Also, let's load the PKCS#7 signature data when finding metadata
right-away, instead of delaying this until we need it. In all cases we
call this there's not much time difference between the metdata finding
and the loading, hence this simplifies things and makes sure root hash
data and its signature is now always acquired together.
With new directive SystemCallLog= it's possible to list system calls to be
logged. This can be used for auditing or temporarily when constructing system
call filters.
---
v5: drop intermediary, update HASHMAP_FOREACH_KEY() use
v4: skip useless debug messages, actually parse directive
v3: don't declare unused variables with old libseccomp
v2: fix build without seccomp or old libseccomp
Define explicit action "kill" for SystemCallErrorNumber=.
In addition to errno code, allow specifying "kill" as action for
SystemCallFilter=.
---
v7: seccomp_parse_errno_or_action() returns -EINVAL if !HAVE_SECCOMP
v6: use streq_ptr(), let errno_to_name() handle bad values, kill processes,
init syscall_errno
v5: actually use seccomp_errno_or_action_to_string(), don't fail bus unit
parsing without seccomp
v4: fix build without seccomp
v3: drop log action
v2: action -> number
Like it's customary in our codebase bus_error_message() internally takes
abs() of the passed error anyway, hence no need to explicitly negate it.
We mostly got this right, but in too many cases we didn't. Fix that.
We already do this for socket and automount units, do it for path units
too: if the triggered service keeps hitting the start limit, then fail
the triggering unit too, so that we don#t busy loop forever.
(Note that this leaves only timer units out in the cold for this kind of
protection, but it shouldn't matter there, as they are naturally
protected against busy loops: they are scheduled by time anyway).
Fixes: #16669
In 4c2ef32767 we enabled propagating
triggered unit state to the triggering unit for service units in more
load states, so that we don't accidentally stop tracking state
correctly.
Do the same for our other triggering unit states: automounts, paths, and
timers.
Also, make this an assertion rather than a simple test. After all it
should never happen that we get called for half-loaded units or units of
the wrong type. The load routines should already have made this
impossible.
We want to be able to use BusName= in services that run during early boot
already, and thus don't synthesize deps on dbus there. Instead add them
when Type=dbus is set, because in that case we actually really need
D-Bus support.
Fixes: #17037
In containers we might lack the privs to up the socket buffers. Let's
not complain so loudly about that. Let's hence downgrade this to debug
logging if it's a permission problem.
(This wasn't an issue before b92f350789
because back then the failures wouldn't be detected at all.)
We generally don't support prefix being != /usr, and this is hardcoded
all over the place. In the systemd.pc file it wasn't so far. Let's
adjust this to match the rest of the codebase.
A variety of sockopts exist both for IPv4 and IPv6 but require a
different pair of sockopt level/option number. Let's add helpers for
these that internally determine the right sockopt to call.
This should shorten code that generically wants to support both ipv4 +
ipv6 and for the first time adds correct support for some cases where we
only called the ipv4 versions, and not the ipv6 options.
The securebit keep-caps retains the capabilities in the permitted set
over an UID change (ambient capabilities are cleared though).
Setting the keep-caps securebit after the uid change and before execve
doesn't make sense as it is cleared during execve and there is no
additional user ID change after this point.
Altough the documentation (man 7 capabilities) is ambigious, keep-caps
is reset during execve although keep-caps-locked is set. After execve
only keep-caps-locked is set and keep-caps is cleared.