Users are often surprised that "systemd-run" command lines like
"systemd-run -p User=idontexist /bin/true" will return successfully,
even though the logs show that the process couldn't be invoked, as the
user "idontexist" doesn't exist. This is because Type=simple will only
wait until fork() succeeded before returning start-up success.
This patch adds a new service type Type=exec, which is very similar to
Type=simple, but waits until the child process completed the execve()
before returning success. It uses a pipe that has O_CLOEXEC set for this
logic, so that the kernel automatically sends POLLHUP on it when the
execve() succeeded but leaves the pipe open if not. This means PID 1
waits exactly until the execve() succeeded in the child, and not longer
and not shorter, which is the desired functionality.
Making use of this new functionality, the command line
"systemd-run -p User=idontexist -p Type=exec /bin/true" will now fail,
as expected.
When process fd lists to pass to activated programs we always place the
socket activation fds first, and the storage fds last. Irritatingly in
almost all calls the "n_storage_fds" parameter (i.e. the number of
storage fds to pass) came first so far, and the "n_socket_fds" parameter
second. Let's clean this up, and specify the number of fds in the order
the fds themselves are passed.
(Also, let's fix one more case where "unsigned" was used to size an
array, while we should use "size_t" instead.)
Whenever a unit is started fresh we should flush out any runtime data
from the previous cycle. We are pretty good at that already, but what so
far we missed was the ExecStart=/ExecStop=/… command exit status data.
Let's fix that, and properly flush out that stuff too.
Consider this service:
[Service]
ExecStart=/bin/sleep infinity
ExecStop=/bin/false
When this service is started, then stopped and then started again
"systemctl status" would show the ExecStop= results of the previous run
along with the ExecStart= results of the current one, which is very
confusing. With this patch this is corrected: the data is kept right
until the moment the new service cycle starts, and then flushed out.
Hence "systemctl status" in that case will only show the ExecStart=
data, but no ExecStop= data, like it should be.
This should fix part of the confusion of #9588
We always initialize it from the same field in ExecCommand anyway, hence
there's no point in passing it separately to exec_spawn(), after all we
already pass the ExecCommand structure itself anyway.
No change in behaviour.
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
They are not needed, because anything that is non-zero is converted
to true.
C11:
> 6.3.1.2: When any scalar value is converted to _Bool, the result is 0 if the
> value compares equal to 0; otherwise, the result is 1.
https://stackoverflow.com/questions/31551888/casting-int-to-bool-in-c-c
This is mostly fall-out from d1a1f0aaf0,
however some cases are older bugs.
There might be more issues lurking, this was a simple grep for "%m"
across the tree, with all lines removed that mention "errno" at all.
This reworks how systemd tracks processes on cgroupv1 systems where
cgroup notification is not reliable. Previously, whenever we had reason
to believe that new processes showed up or got removed we'd scan the
cgroup of the scope or service unit for new processes, and would tidy up
the list of PIDs previously watched. This scanning is relatively slow,
and does not scale well. With this change behaviour is changed: instead
of scanning for new/removed processes right away we do this work in a
per-unit deferred event loop job. This event source is scheduled at a
very low priority, so that it is executed when we have time but does not
starve other event sources. This has two benefits: this expensive work is
coalesced, if events happen in quick succession, and we won't delay
SIGCHLD handling for too long.
This patch basically replaces all direct invocation of
unit_watch_all_pids() in scope.c and service.c with invocations of the
new unit_enqueue_rewatch_pids() call which just enqueues a request of
watching/tidying up the PID sets (with one exception: in
scope_enter_signal() and service_enter_signal() we'll still do
unit_watch_all_pids() synchronously first, since we really want to know
all processes we are about to kill so that we can track them properly.
Moreover, all direct invocations of unit_tidy_watch_pids() and
unit_synthesize_cgroup_empty_event() are removed too, when the
unit_enqueue_rewatch_pids() call is invoked, as the queued job will run
those operations too.
All of this is done on cgroupsv1 systems only, and is disabled on
cgroupsv2 systems as cgroup-empty notifications are reliable there, and
we do not need SIGCHLD events to track processes there.
Fixes: #9138
Previously, we'd not care about failures that were seen earlier and
remain in "exited" state. This could be triggered if the main process of
a service failed while ExecStartPost= was still running, as in that case
we'd not immediately act on the main process failure because we needed
to wait for ExecStartPost= to finish, before acting on it.
Fixes: #8929
The function is similar to path_kill_slashes() but also removes
initial './', trailing '/.', and '/./' in the path.
When the second argument of path_simplify() is false, then it
behaves as the same as path_kill_slashes(). Hence, this also
replaces path_kill_slashes() with path_simplify().
This adds a flags parameter to unit_notify() which can be used to pass
additional notification information to the function. We the make the old
reload_failure boolean parameter one of these flags, and then add a new
flag that let's unit_notify() if we are configured to restart the
service.
Note that this adjusts behaviour of systemd to match what the docs say.
Fixes: #8398
Commenting out "WatchdogTimeout=3min" in systemd-logind.service causes
NotifyAccess to go from "main" to "none", breaking support for logind
restart. Let's fix that.
Previously we were a bit sloppy with the index and size types of arrays,
we'd regularly use unsigned. While I don't think this ever resulted in
real issues I think we should be more careful there and follow a
stricter regime: unless there's a strong reason not to use size_t for
array sizes and indexes, size_t it should be. Any allocations we do
ultimately will use size_t anyway, and converting forth and back between
unsigned and size_t will always be a source of problems.
Note that on 32bit machines "unsigned" and "size_t" are equivalent, and
on 64bit machines our arrays shouldn't grow that large anyway, and if
they do we have a problem, however that kind of overly large allocation
we have protections for usually, but for overflows we do not have that
so much, hence let's add it.
So yeah, it's a story of the current code being already "good enough",
but I think some extra type hygiene is better.
This patch tries to be comprehensive, but it probably isn't and I missed
a few cases. But I guess we can cover that later as we notice it. Among
smaller fixes, this changes:
1. strv_length()' return type becomes size_t
2. the unit file changes array size becomes size_t
3. DNS answer and query array sizes become size_t
Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=76745
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
This macro will read a pointer of any type, return it, and set the
pointer to NULL. This is useful as an explicit concept of passing
ownership of a memory area between pointers.
This takes inspiration from Rust:
https://doc.rust-lang.org/std/option/enum.Option.html#method.take
and was suggested by Alan Jenkins (@sourcejedi).
It drops ~160 lines of code from our codebase, which makes me like it.
Also, I think it clarifies passing of ownership, and thus helps
readability a bit (at least for the initiated who know the new macro)
Before this change all unit types would default to "private" in the
system service manager and "inherit" to in the user service manager.
With this change this is slightly altered: non-service units of the
system service manager are now run with KeyringMode=shared. This appears
to be the more appropriate choice as isolation is not as desirable for
mount tools, which regularly consume key material. After all mounts are
a shared resource themselves as they appear system-wide hence it makes a
lot of sense to share their key material too.
Fixes: #8159
No functional change.
The source unit manages the reference. It allocates the UnitRef structure and
registers it in the target unit, and then the reference must be destroyed
before the source unit is destroyed. Thus, is should be OK to include the
pointer to the source unit, it should be live as long as the reference exists.
v2:
- rename refs to refs_by_target
"check" is unclear: what is true, what is false? Let's rename to "can_gc" and
revert the return value ("positive" values are easier to grok).
v2:
- rename from unit_can_gc to unit_may_gc
Let's simplify things a bit: we so far called both functions every
single time, let's just merge one into the other, so that we have fewer
functions to call.
Currently we allowed delegation for alluntis with cgroup backing
except for slices. Let's make this a bit more strict for now, and only
allow this in service and scope units.
Let's also add a generic accessor unit_cgroup_delegate() for checking
whether a unit has delegation turned on that checks the new bool first.
Also, when doing transient units, let's explcitly refuse turning on
delegation for unit types that don#t support it. This is mostly
cosmetical as we wouldn't act on the delegation request anyway, but
certainly helpful for debugging.
Let's read the PID file after all if there's a potentially unsafe
symlink chain in place. But if we do, then refuse taking the PID if its
outside of the cgroup.
Fixes: #8085
Before this, each ExecRuntime object is owned by a unit. However,
it may be shared with other units which enable JoinsNamespaceOf=.
Thus, by the serialization/deserialization process, its sharing
information, more specifically, reference counter is lost, and
causes issue #7790.
This makes ExecRuntime objects be managed by manager, and changes
the serialization/deserialization process.
Fixes#7790.
Let's add a per-unit boolean that tells us whether our unit is currently
counted or not. This way it's unlikely we get out of sync again and
things are generally more robust.
This also allows us to remove the counting logic specific to service
units (which was in fact mostly a copy from the generic implementation),
in favour of fully generic code.
Replaces: #7824
This call determines whether a specific unit currently needs access to
the console. It's a fancy wrapper around
exec_context_may_touch_console() ultimately, however for service units
we'll explicitly exclude the SERVICE_EXITED state from when we report
true.
The left side of the || expression is conditionalized on SERVICE_START,
but SERVICE_START is blanket listed on the right side anyway, hence we
can drop the left side entirely without any change in behaviour.
Moreover, if main_pid is initialized, it should be watched, hence this
is even the safe and right thing to do.
This code is very similar in scope and service units, let's unify it in
one function. This changes little for service units, but for scope units
makes sure we go through the cgroup queue, which is something we should
do anyway.
Let's send them out only if the main or control processe exited and we
recorded a new exit status that is worth reporting. But if any other
service process died this is nothing to report since we don't expose any
properties about that anyway.
Let's be more restrictive when validating PID files and MAINPID=
messages: don't accept PIDs that make no sense, and if the configuration
source is not trusted, don't accept out-of-cgroup PIDs. A configuratin
source is considered trusted when the PID file is owned by root, or the
message was received from root.
This should lock things down a bit, in case service authors write out
PID files from unprivileged code or use NotifyAccess=all with
unprivileged code. Note that doing so was always problematic, just now
it's a bit less problematic.
When we open the PID file we'll now use the CHASE_SAFE chase_symlinks()
logic, to ensure that we won't follow an unpriviled-owned symlink to a
privileged-owned file thinking this was a valid privileged PID file,
even though it really isn't.
Fixes: #6632
This makes things a bit easier to read I think, and also makes sure we
always use the _unlikely_ wrapper around it, which so far we used
sometimes and other times we didn't. Let's clean that up.
With Type=notify services, EXTEND_TIMEOUT_USEC= messages will delay any startup/
runtime/shutdown timeouts.
A service that hasn't timed out, i.e, start time < TimeStartSec,
runtime < RuntimeMaxSec and stop time < TimeoutStopSec, may by sending
EXTEND_TIMEOUT_USEC=, allow the service to continue beyond the limit for
the execution phase (i.e TimeStartSec, RunTimeMaxSec and TimeoutStopSec).
EXTEND_TIMEOUT_USEC= must continue to be sent (in the same way as
WATCHDOG=1) within the time interval specified to continue to reprevent
the timeout from occuring.
Watchdog timeouts are also extended if a EXTEND_TIMEOUT_USEC is greater
than the remaining time on the watchdog counter.
Fixes#5868.
An auto-restarted unit B may depend on unit A with StopWhenUnneeded=yes.
If A stops before B's restart timeout expires, it'll be started again as part
of B's dependent jobs. However, if stopping takes longer than the timeout, B's
running stop job collides start job which also cancels B's start job. Result is
that neither A or B are active.
Currently, when a service with automatic restarting fails, it transitions
through following states:
1) SERVICE_FAILED or SERVICE_DEAD to indicate the failure,
2) SERVICE_AUTO_RESTART while restart timer is running.
The StopWhenUnneeded= check takes place in service_enter_dead between the two
state mentioned above. We temporarily store the auto restart flag to query it
during the check. Because we don't return control to the main event loop, this
new service unit flag needn't be serialized.
This patch prevents the pathologic situation when the service with Restart=
won't restart automatically. As a side effect it also avoid restarting the
dependency unit with StopWhenUnneeded=yes.
Fixes: #7377
Let's always use the same logic when parsing error numbers, i.e. use
parse_errno() here too, to unify some code, and tighten the checks a
bit.
This also allows clients to pass errors as symbolic names. Probably
nothing we want to advertise too eagerly (since new daemons generating
this on old service managers won't understand), but still pretty
useful I think, in particular in scripting languages and such, where the
numeric error numbers might not be readily available.
Currenly the only way to remove fds from the fdstore is to fully
stop the service, or to somehow trigger POLLERR/POLLHUP on the fd, in
which case systemd will remove the fd automatically.
Let's add another way: a new message that can be sent to remove fds
explicitly, given their name.
Of course, it's not really a valid sd_notify() message if multiple of
these fields are used in one, but let's handle this somewhat gracefully,
by only processing one of them, and ignoring the rest.
Let's optimize things a bit for the non-debug case. No change in
behaviour.
Main reason to do this is not so much the speed benefit though, but
merely to isolate the code from its surroundings more.
Now that we don't kill control processes anymore, let's at least warn
about any processes left-over in the unit cgroup at the moment of
starting the unit.
Let's move the cgroup empty check for all unit types into the generic
unit_check_gc() call, out of the per-unit-type _check_gc() type. This
not only allows us to share some code, but also hooks up mount and
socket units with this kind of check, for free, as it was missing there
previously.
Previously, in the service unit type we ran all control processes in a
special subcgroup /control of the unit's main cgroup. Remove that, and
run the control program in the main cgroup instead.
The concept conflicts with cgroupv2's logic of "no processes in inner
nodes": if a unit has a main daemon process running in the main cgroup,
and a reload control process would be started in the /control subcgroup,
then this would necessarily fail, as the main daemon process would
become an inner node process that way.
We could in theory continue to support this in cgroupv1, but in the
interest in keeping behaviour similar in both hierarchies, let's drop
this altogether.
Philosophically maybe it wasn't the greatest idea anyway to just go
berserk and SIGKILL all those processes — loud warning logging might
have sufficed, too.
This introduces a new function unit_prepare_exec() that encapsulates a
number of calls we do in preparation for spawning off some processes in
all our unit types that do so.
This allows us to neatly unify a bit of code between unit types and
shorten our code.
Let's handle gracefully if a client disconnects very early on.
This builds on #4120, but relaxes the condition checks further, since we
getpeername() might already fail during ExecStartPre= and friends.
Fixes: #7172
Both permit configuring data to pass through STDIN to an invoked
process. StandardInputText= accepts a line of text (possibly with
embedded C-style escapes as well as unit specifiers), which is appended
to the buffer to pass as stdin, followed by a single newline.
StandardInputData= is similar, but accepts arbitrary base64 encoded
data, and will not resolve specifiers or C-style escapes, nor append
newlines.
This may be used to pass input/configuration data to services, directly
in-line from unit files, either in a cooked or in a more raw format.
When preparing for a restart we quickly go through the DEAD/INACTIVE
service state before entering AUTO_RESTART. When doing this, we need to
make sure we don't destroy the FD store. Previously this was done by
checking the failure state of the unit, and keeping the FD store around
when the unit failed, under the assumption that the restart logic will
then get into action.
This is not entirely correct howver, as there might be failure states
that will no result in restarts.
With this commit we slightly alter the logic: a ref counter for the fd
store is added, that is increased right before we handle the restart
logic, and decreased again right-after.
This should ensure that the fdstore lives exactly as long as it needs.
Follow-up for f0bfbfac43.
And let's make use of it to implement two new unit settings with it:
1. LogLevelMax= is a new per-unit setting that may be used to configure
log priority filtering: set it to LogLevelMax=notice and only
messages of level "notice" and lower (i.e. more important) will be
processed, all others are dropped.
2. LogExtraFields= is a new per-unit setting for configuring per-unit
journal fields, that are implicitly included in every log record
generated by the unit's processes. It takes field/value pairs in the
form of FOO=BAR.
Also, related to this, one exisiting unit setting is ported to this new
facility:
3. The invocation ID is now pulled from /run/systemd/units/ instead of
cgroupfs xattrs. This substantially relaxes requirements of systemd
on the kernel version and the privileges it runs with (specifically,
cgroupfs xattrs are not available in containers, since they are
stored in kernel memory, and hence are unsafe to permit to lesser
privileged code).
/run/systemd/units/ is a new directory, which contains a number of files
and symlinks encoding the above information. PID 1 creates and manages
these files, and journald reads them from there.
Note that this is supposed to be a direct path between PID 1 and the
journal only, due to the special runtime environment the journal runs
in. Normally, today we shouldn't introduce new interfaces that (mis-)use
a file system as IPC framework, and instead just an IPC system, but this
is very hard to do between the journal and PID 1, as long as the IPC
system is a subject PID 1 manages, and itself a client to the journal.
This patch cleans up a couple of types used in journal code:
specifically we switch to size_t for a couple of memory-sizing values,
as size_t is the right choice for everything that is memory.
Fixes: #4089Fixes: #3041Fixes: #4441
This replaces the dependencies Set* objects by Hashmap* objects, where
the key is the depending Unit, and the value is a bitmask encoding why
the specific dependency was created.
The bitmask contains a number of different, defined bits, that indicate
why dependencies exist, for example whether they are created due to
explicitly configured deps in files, by udev rules or implicitly.
Note that memory usage is not increased by this change, even though we
store more information, as we manage to encode the bit mask inside the
value pointer each Hashmap entry contains.
Why this all? When we know how a dependency came to be, we can update
dependencies correctly when a configuration source changes but others
are left unaltered. Specifically:
1. We can fix UDEV_WANTS dependency generation: so far we kept adding
dependencies configured that way, but if a device lost such a
dependency we couldn't them again as there was no scheme for removing
of dependencies in place.
2. We can implement "pin-pointed" reload of unit files. If we know what
dependencies were created as result of configuration in a unit file,
then we know what to flush out when we want to reload it.
3. It's useful for debugging: "systemd-analyze dump" now shows
this information, helping substantially with understanding how
systemd's dependency tree came to be the way it came to be.
Failure to spawn ExecStartPost was being handled differently to e.g.
EXIT_FAILURE returned by ExecStartPost. It looks like this was an
oversight. Fix to match documented behaviour.
`man systemd.service`:
> Note that if any of the commands specified in ExecStartPre=, ExecStart=,
> or ExecStartPost= fail (and are not prefixed with "-", see above) or time
> out before the service is fully up, execution continues with commands
> specified in ExecStopPost=, the commands in ExecStop= are skipped.
No need to wait for a timeout when we know things are not going to work out.
When the main process goes away and only notifications from the main process are
accepted, then we will not receive any notifications anymore.
Currently, all three of cgroup_good(), main_pid_good(),
control_pid_good() all return an "int" (two of them propagate errors).
It's a good thing to keep the three functions similar, so let's leave it
at that, but then let's clean up the invocation of the three functions
so that they always clearly acknowledge that the return value is not a
bool, but potentially negative.
The compiler should be good enough to figure this out on its own if this
is a static function, and it makes control_pid_good() an outlier anyway,
and decorators like this tend to bitrot. Hence, to keep things simple
and automatic, let's just drop the decorator.
The processes associated with a service are not just the ones in its
cgroup, but also the control and main processes, which might possibly
live outside of it, for example if they transitioned into their own
cgroups because they registered a PAM session of their own. Hence, if we
get a cgroup empty notification always check if the main PID is still
around before taking action too eagerly.
Fixes: #6045
This slightly changes how we log about failures. Previously,
service_enter_dead() would log that a service unit failed along with its
result code, and unit_notify() would do this again but without the
result code. For other unit types only the latter would take effect.
This cleans this up: we keep the message in unit_notify() only for debug
purposes, and add type-specific log lines to all our unit types that can
fail, and always place them before unit_notify() is invoked.
Or in other words: the duplicate log message for service units is
removed, and all other unit types get a more useful line with the
precise result code.
This makes sure that if we learn via inotify or another event source
that a cgroup is empty, and we checked that this is indeed the case (as
we might get spurious notifications through inotify, as the inotify
logic through the "cgroups.event" is pretty unspecific and might be
trigger for a variety of reasons), then we'll enqueue a defer event for
it, at a priority lower than SIGCHLD handling, so that we know for sure
that if there's waitid() data for a process we used it before
considering the cgroup empty notification.
Fixes: #6608
For some reason we didn't dump the cgroup context for a number of unit
types, including service units. Not sure how this wasn't noticed
before... Add this in.
This commit fixes crash described in
https://github.com/systemd/systemd/issues/6533
Multiple ExecStart lines are allowed only for oneshot services
anyway so it doesn't make sense to call service_run_next_main() with
services of type other than SERVICE_ONESHOT.
Referring back to reproducer from the issue, previously we didn't observe
this problem because s->main_command was reset after daemon-reload hence
we never reached the assert statement in service_run_next_main().
Fixes#6533
"Permissions" was a bit of a misnomer, as it suggests that UNIX file
permission bits are adjusted, which aren't really changed here. Instead,
this is about UNIX credentials such as users or groups, as well as
namespacing, hence let's use a more generic term here, without any
misleading reference to UNIX file permissions: "sandboxing", which shall
refer to all kinds of sandboxing technologies, including UID/GID
dropping, selinux relabelling, namespacing, seccomp, and so on.
The new unit_set_exec_params() call is to units what
manager_set_exec_params() is to the manager object: it initializes the
various fields from the relevant generic properties set.
This adds a per-service restart counter. Each time an automatic
restart is scheduled (due to Restart=) it is increased by one. Its
current value is exposed over the bus as NRestarts=. It is also logged
(in a structured, recognizable way) on each restart.
Note that this really only counts automatic starts triggered by Restart=
(which it nicely complements). Manual restarts will reset the counter,
as will explicit calls to "systemctl reset-failed". It's supposed to be
a tool for measure the automatic restart feature, and nothing else.
Fixes: #4126