Some kdbus_flag and memfd related parts are left behind, because they
are entangled with the "legacy" dbus support.
test-bus-benchmark is switched to "manual". It was already broken before
(in the non-kdbus mode) but apparently nobody noticed. Hopefully it can
be fixed later.
This moves pretty much all uses of getpid() over to getpid_raw(). I
didn't specifically check whether the optimization is worth it for each
replacement, but in order to keep things simple and systematic I
switched over everything at once.
This introduces {State,Cache,Log,Configuration}Directory= those are
similar to RuntimeDirectory=. They create the directories under
/var/lib, /var/cache/, /var/log, or /etc, respectively, with the mode
specified in {State,Cache,Log,Configuration}DirectoryMode=.
This also fixes#6391.
This also alters the documentation to recommend memfds rather than /run
for serializing state across reboots. That's because /run doesn't
actually have the same lifecycle as the fd store, as it is cleared out
on restarts.
Fixes: #5606
'n_fds' field in the ExecParameters structure was counting the total number of
file descriptors to be passed to a unit.
This counter also includes the number of passed socket fds which is counted by
'n_socket_fds' already.
This patch removes that redundancy by replacing 'n_fds' with
'n_storage_fds'. The new field only counts the fds passed via the storage store
mechanism. That way each fd is counted at one place only.
Subsequently the patch makes sure to fix code that used 'n_fds' and also wanted
to iterate through all of them by explicitly adding 'n_socket_fds' + 'n_storage_fds'.
Suggested by Lennart.
Make sure to only apply the O_NONBLOCK flag to the fds passed via socket
activation.
Previously the flag was also applied to the fds which came from the fd store
but this was incorrect since services, after being restarted, expect that these
passed fds have their flags unchanged and can be reused as before.
The documentation was a bit unclear about this so clarify it.
log_struct takes multiple format strings, each one followed by arguments.
The _printf_ annotation is not sufficiently flexible to express this,
but we can still annotate the first format string, though not its
arguments (because their number is unknown).
With the annotation, the places which specified the message id or similar
as the first pattern cause a warning from -Wformat-nonliteral. This can
be trivially fixed by putting the MESSAGE= first.
This change will help find issues where a non-literal is erroneously used
as the pattern.
Stored information will help us to resume execution after the
daemon-reload.
This commit implements following scheme,
* On serialization:
- we count rank of the currently executing command
- we store command type, its rank and command line arguments
* On deserialization:
- configuration is parsed and loaded
- we deserialize stored data, command type, rank and arguments
- we look at the given rank in the list and if command there has same
arguments then we restore execution at that point
- otherwise we search respective command list and we look for command
that has the same arguments
- if both methods fail we do not do not resume execution at all
To better illustrate how does above scheme works, please consider
following cases (<<< denotes position where we resume execution after reload)
; Original unit file
[Service]
ExecStart=/bin/true <<<
ExecStart=/bin/false
; Swapped commands
; Second command is not going to be executed
[Service]
ExecStart=/bin/false
ExecStart=/bin/true <<<
; Commands added before
; Same commands are problematic and execution could be restarted at wrong place
[Service]
ExecStart=/bin/foo
ExecStart=/bin/bar
ExecStart=/bin/true <<<
ExecStart=/bin/false
; Commands added after
; Same commands are not an issue in this case
[Service]
ExecStart=/bin/true <<<
ExecStart=/bin/false
ExecStart=/bin/foo
ExecStart=/bin/bar
; New commands interleaved with old commands
; Some new commands will be executed while others won't
ExecStart=/bin/foo
ExecStart=/bin/true <<<
ExecStart=/bin/bar
ExecStart=/bin/false
As you can see, above scheme has some drawbacks. However, in most
cases (we assume that in most common case unit file command list is not
changed while some other command is running for the same unit) it
should cause that systemd does the right thing, which is restoring
execution exactly at the point we were before daemon-reload.
Fixes#518
We use our cgroup APIs in various contexts, including from our libraries
sd-login, sd-bus. As we don#t control those environments we can't rely
that the unified cgroup setup logic succeeds, and hence really shouldn't
assert on it.
This more or less reverts 415fc41cea.
cg_[all_]unified() test whether a specific controller or all controllers are on
the unified hierarchy. While what's being asked is a simple binary question,
the callers must assume that the functions may fail any time, which
unnecessarily complicates their usages. This complication is unnecessary.
Internally, the test result is cached anyway and there are only a few places
where the test actually needs to be performed.
This patch simplifies cg_[all_]unified().
* cg_[all_]unified() are updated to return bool. If the result can't be
decided, assertion failure is triggered. Error handlings from their callers
are dropped.
* cg_unified_flush() is updated to calculate the new result synchrnously and
return whether it succeeded or not. Places which need to flush the test
result are updated to test for failure. This ensures that all the following
cg_[all_]unified() tests succeed.
* Places which expected possible cg_[all_]unified() failures are updated to
call and test cg_unified_flush() before calling cg_[all_]unified(). This
includes functions used while setting up mounts during boot and
manager_setup_cgroup().
Accept AF_VSOCK listen addresses in socket unit files. Both guest and
host can now take advantage of socket activation.
The QEMU guest agent has recently been modified to support socket
activation and can run over AF_VSOCK with this patch.
sockaddr_port() either returns a >= 0 port number or a negative errno.
This works for AF_INET and AF_INET6 because port ranges are only 16-bit.
In AF_VSOCK ports are 32-bit so an int cannot represent all port number
and negative errnos. Separate the port and the return code.
This patch ensures that each system service gets its own session kernel keyring
automatically, and implicitly. Without this a keyring is allocated for it
on-demand, but is then linked with the user's kernel keyring, which is OK
behaviour for logged in users, but not so much for system services.
With this change each service gets a session keyring that is specific to the
service and ceases to exist when the service is shut down. The session keyring
is not linked up with the user keyring and keys hence only search within the
session boundaries by default.
(This is useful in a later commit to store per-service material in the keyring,
for example the invocation ID)
(With input from David Howells)
This monopolizes unit file specifier expansion in load-fragment.c, and removes
it from socket.c + service.c. This way expansion becomes an operation done exclusively at time of loading unit files.
Previously specifiers were resolved for all settings during loading of unit
files with the exception of ExecStart= and friends which were resolved in
socket.c and service.c. With this change the latter is also moved to the
loading of unit files.
Fixes: #3061
We stay in the SERVICE_START while no READY=1 notification message has
been received. When we are in the SERVICE_START_POST state, we have
already received a ready notification. Hence we should not fail when the
cgroup becomes empty in that state.
Before this commit, when the main process of a Type=notify service exits the
service would enter a running state without passing through the startup post
state. This meant ExecStartPost= from being executed and allowed follow-up
units to start too early (before the ready notification).
Additionally, when RemainAfterExit=yes is used on a Type=notify service, the
exit status of the main process would be disregarded.
After this commit, an unsuccessful exit of the main process of a Type=notify
service puts the unit in a failed state. A successful exit is inconsequential
in case RemainAfterExit=yes. Otherwise, when no ready notification has been
received, the unit is put in a failed state because it has never been active.
When all processes in the cgroup of a Type=notify service are gone and no ready
notification has been received yet, the unit is also put in a failed state.
Introduce a SERVICE_FAILURE_PROTOCOL error type for when a service does
not follow the protocol.
This error type is used when a pid file is expected, but not delivered.
It's rather hard to parse the confirmation messages (enabled with
systemd.confirm_spawn=true) amongst the status messages and the kernel
ones (if enabled).
This patch gives the possibility to the user to redirect the confirmation
message to a different virtual console, either by giving its name or its path,
so those messages are separated from the other ones and easier to read.
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
Since service_add_fd_store() already does the check, remove the redundant check
from service_add_fd_store_set().
Also, print a warning when repopulating FDStore after daemon-reexec and we hit
the limit. This is a user visible issue, so we should not discard fds silently.
(Note that service_deserialize_item is impacted by the return value from
service_add_fd_store(), but we rely on the general error message, so the caller
does not need to be modified, and does not show up in the diff.)
We would close all the stored fds in service_release_resources(), which of
course broke the whole concept of storing fds over service restart.
Fixes#4408.
It's a common pattern, so add a helper for it. A macro is necessary
because a function that takes a pointer to a pointer would be type specific,
similarly to cleanup functions. Seems better to use a macro.
SIGTERM should be considered a clean exit code for daemons (i.e. long-running
processes, as a daemon without SIGTERM handler may be shut down without issues
via SIGTERM still) while it should not be considered a clean exit code for
commands (i.e. short-running processes).
Let's add two different clean checking modes for this, and use the right one at
the appropriate places.
Fixes: #4275
Let's get rid of is_clean_exit_lsb(), let's move the logic for the special
handling of the two LSB exit codes into the sysv-generator by writing out
appropriate SuccessExitStatus= lines if the LSB header exists. This is not only
semantically more correct, bug also fixes a bug as the code in service.c that
chose between is_clean_exit_lsb() and is_clean_exit() based this check on
whether a native unit files was available for the unit. However, that check was
bogus since a long time, since the SysV generator was introduced and native
SysV script support was removed from PID 1, as in that case a unit file always
existed.
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.
The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.
The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.
The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.
Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.
This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().
PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.
A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.
Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.