This splits the "environment" field of Manager into two:
transient_environment and client_environment. The former is generated
from configuration file, kernel cmdline, environment generators. The
latter is the one the user can control with "systemctl set-environment"
and similar.
Both sets are merged transparently whenever needed. Separating the two
sets has the benefit that we can safely flush out the former while
keeping the latter during daemon reload cycles, so that env var settings
from env generators or configuration files do not accumulate, but
dynamic API changes are kept around.
Note that this change is not entirely transparent to users: if the user
first uses "set-environment" to override a transient variable, and then
uses "unset-environment" to unset it again things will revert to the
original transient variable now, while previously the variable was fully
removed. This change in behaviour should not matter too much though I
figure.
Fixes: #9972
Let's be more careful with what we serialize: let's ensure we never
serialize strings that are longer than LONG_LINE_MAX, so that we know we
can read them back with read_line(…, LONG_LINE_MAX, …) safely.
In order to implement this all serialization functions are move to
serialize.[ch], and internally will do line size checks. We'd rather
skip a serialization line (with a loud warning) than write an overly
long line out. Of course, this is just a second level protection, after
all the data we serialize shouldn't be this long in the first place.
While we are at it also clean up logging: while serializing make sure to
always log about errors immediately. Also, (void)ify all calls we don't
expect errors in (or catch errors as part of the general
fflush_and_check() at the end.
Users are often surprised that "systemd-run" command lines like
"systemd-run -p User=idontexist /bin/true" will return successfully,
even though the logs show that the process couldn't be invoked, as the
user "idontexist" doesn't exist. This is because Type=simple will only
wait until fork() succeeded before returning start-up success.
This patch adds a new service type Type=exec, which is very similar to
Type=simple, but waits until the child process completed the execve()
before returning success. It uses a pipe that has O_CLOEXEC set for this
logic, so that the kernel automatically sends POLLHUP on it when the
execve() succeeded but leaves the pipe open if not. This means PID 1
waits exactly until the execve() succeeded in the child, and not longer
and not shorter, which is the desired functionality.
Making use of this new functionality, the command line
"systemd-run -p User=idontexist -p Type=exec /bin/true" will now fail,
as expected.
Whenever a unit is started fresh we should flush out any runtime data
from the previous cycle. We are pretty good at that already, but what so
far we missed was the ExecStart=/ExecStop=/… command exit status data.
Let's fix that, and properly flush out that stuff too.
Consider this service:
[Service]
ExecStart=/bin/sleep infinity
ExecStop=/bin/false
When this service is started, then stopped and then started again
"systemctl status" would show the ExecStop= results of the previous run
along with the ExecStart= results of the current one, which is very
confusing. With this patch this is corrected: the data is kept right
until the moment the new service cycle starts, and then flushed out.
Hence "systemctl status" in that case will only show the ExecStart=
data, but no ExecStop= data, like it should be.
This should fix part of the confusion of #9588
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
Previously the enumerate() callback defined for each unit type would do
two things:
1. It would create perpetual units (i.e. -.slice, system.slice, -.mount and
init.scope)
2. It would enumerate units from /proc/self/mountinfo, /proc/swaps and
the udev database
With this change these two parts are split into two seperate methods:
enumerate() now only does #2, while enumerate_perpetual() is responsible
for #1. Why make this change? Well, perpetual units should have a
slightly different effect that those found through enumeration: as
perpetual units should be up unconditionally, perpetually and thus never
change state, they should also not pull in deps by their state changing,
not even when the state is first set to active. Thus, their state is
generally initialized through the per-device coldplug() method in
similar fashion to the deserialized state from a previous run would be
put into place. OTOH units found through regular enumeration should
result in state changes (and thus pull in deps due to state changes),
hence their state should be put in effect in the catchup() method
instead. Hence, given this difference, let's also separate the
functions, so that the rule is:
1. What is created in enumerate_perpetual() should be started in
coldplug()
2. What is created in enumerate() should be started in catchup().
let's drop the "now" argument, it's exactly what MANAGER_IS_RUNNING()
returns, hence let's use that instead to simplify things.
Moreover, let's change the add/found argument pair to become found/mask,
which allows us to change multiple flags at the same time into opposing
directions, which will be useful later on.
Also, let's change the return type to void. It's a notifier call where
callers will ignore the return value anyway as it is nothing actionable.
Should not change behaviour.
The function is similar to path_kill_slashes() but also removes
initial './', trailing '/.', and '/./' in the path.
When the second argument of path_simplify() is false, then it
behaves as the same as path_kill_slashes(). Hence, this also
replaces path_kill_slashes() with path_simplify().
This adds a flags parameter to unit_notify() which can be used to pass
additional notification information to the function. We the make the old
reload_failure boolean parameter one of these flags, and then add a new
flag that let's unit_notify() if we are configured to restart the
service.
Note that this adjusts behaviour of systemd to match what the docs say.
Fixes: #8398
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
linux/fs.h sys/mount.h, libmount.h and missing.h all include MS_*
definitions.
To avoid problems, only one of linux/fs.h, sys/mount.h and libmount.h
should be included. And missing.h must be included last.
Without this, building systemd may fail with:
In file included from [...]/libmount/libmount.h:31:0,
from ../systemd-238/src/core/manager.h:23,
from ../systemd-238/src/core/emergency-action.h:37,
from ../systemd-238/src/core/unit.h:34,
from ../systemd-238/src/core/dbus-timer.h:25,
from ../systemd-238/src/core/timer.c:26:
[...]/sys/mount.h:57:2: error: expected identifier before numeric constant
This is analogous to 8d3ae2bd4c, except that now
src/core/umount.c not src/core/mount.c is converted.
Might help with https://bugzilla.redhat.com/show_bug.cgi?id=1554943, or not.
In the patch, mnt_free_tablep and mnt_free_iterp are declared twice. It'd
be nicer to define them just once in mount-setup.h, but then libmount.h would
have to be included there. libmount.h seems to be buggy, and declares some
defines which break other headers, and working around this is more pain than
the two duplicate lines. So let's live with the duplication for now.
This fixes memleak of MountPoint in mount_points_list_get() on error, not that
it matters any.
"check" is unclear: what is true, what is false? Let's rename to "can_gc" and
revert the return value ("positive" values are easier to grok).
v2:
- rename from unit_can_gc to unit_may_gc
Let's simplify things a bit: we so far called both functions every
single time, let's just merge one into the other, so that we have fewer
functions to call.
Before this, each ExecRuntime object is owned by a unit. However,
it may be shared with other units which enable JoinsNamespaceOf=.
Thus, by the serialization/deserialization process, its sharing
information, more specifically, reference counter is lost, and
causes issue #7790.
This makes ExecRuntime objects be managed by manager, and changes
the serialization/deserialization process.
Fixes#7790.
We do that in all other cases, let's do it here too. Since
SD_EVENT_PRIORITY_NORMAL evaluates to zero there's zero effective
difference, but it makes things easier to grok and grep for if we always
express relative priorities within PID 1 only.
So far, we considered mount units activated as soon as the mount
appeared. This avoided seeing a difference between mounts started by
systemd, and e.g. by running `mount` from a terminal.
(`umount` was not handled this way).
However in some cases, options passed to `mount` require additional
system calls after the mount is successfully created. E.g. the
`private` mount option, or the `ro` option on bind mounts.
It seems best to wait for mount to finish doing that. E.g. in
the `private` case, the current behaviour could theoretically cause
non-deterministic results, as child mounts inherit the
private/shared propagation setting from their parent.
This also avoids a special case in mount_reload().
It _looks_ as if, back when we used to retry unsuccessful calls to umount,
this would have inflated the effective timeout. Multiplying it by
RETRY_UMOUNT_MAX. Which is set to 32.
I'm surprised if it's true: I would have expected it to be noticed during
the work on NFS timeouts. But I can't see what would have stopped it.
Clarify that I do not expect this to happen anymore. I think each
individual umount call is allowed up to the full timeout, but if umount
ever exited with a signal status, we would stop retrying.
To be extra clear, make sure that we do not retry in the event that umount
perversely returned EXIT_SUCCESS after receiving SIGTERM.
"Due to the io event priority logic we can be sure the new mountinfo is
loaded before we process the SIGCHLD for the mount command."
I think this is a reasonable expectation. But if it works, then the
other comment must be false:
"Note that mount(8) returning and the kernel sending us a mount table
change event might happen out-of-order."
Therefore we can clean up the code for the latter.
If this is working as advertised, then we can make sure that mount units
fail if the mount we thought we were creating did not actually appear,
due to races or trickery (or because /sbin/mount did something unexpected
despite returning EXIT_SUCCESS).
Include a specific warning message for this failure.
If we give up when the mount point is still mounted after 32 successful
calls to /sbin/umount, that seems a fairly similar case. So make that
message a LOG_WARN as well (not LOG_DEBUG). Also, this was recently changed to only
retry while umount is returning EXIT_SUCCESS; in that case in particular
there would be no other messages in the log to suggest what had happened.
It was forbidden to create mount units for a symlink. But the reason is
that the mount unit needs to know the real path that will appear in
/proc/self/mountinfo. The kernel dereferences *all* the symlinks in the
path at mount time (I checked this with `mount -c` running under `strace`).
This will have no effect on most systems. As recommended by docs, most
systems use /etc/fstab, as opposed to native mount unit files.
fstab-generator dereferences symlinks for backwards compatibility.
A relatively minor issue regarding Time Of Check / Time Of Use also exists
here. I can't see how to get rid of it entirely. If we pass an absolute
path to mount, the racing process can replace it with a symlink. If we
chdir() to the mount point and pass ".", the racing process can move the
directory. The latter might potentially be nicer, except that it breaks
WorkingDirectory=.
I'm not saying the race is relevant to security - I just want to consider
how bad the effect is. Currently, it can make the mount unit active (and
hence the job return success), despite there never being a matching entry
in /proc/self/mountinfo. This wart will be removed in the next commit;
i.e. it will make the mount unit fail instead.
Testing the previous commit with `systemctl stop tmp.mount` logged the
reason for failure as expected, but unexpectedly the message was repeated
32 times.
The retry is a special case for umount; it is only supposed to cover the
case where the umount command was _successful_, but there was still some
remaining mount(s) underneath. Fix it by making sure to test the first
condition :).
Re-tested with and without a preceding `mount --bind /mnt /tmp`,
and using `findmnt` to check the end result.
Documentation - systemd.exec - strongly implies mount units get logging.
It is safe for mounts to depend on systemd-journald.socket. There is no
cyclic dependency generated. This is because the root, -.mount, was
already deliberately set to EXEC_OUTPUT_NULL. See comment in
mount_load_root_mount(). And /run is excluded from being a mount unit.
Nor does systemd-journald depend on /var. It starts earlier, initially
logging to /run.
Tested before/after using `systemctl stop tmp.mount`.
Currently, if there are two /proc/self/mountinfo entries with the same
mount point path, the mount setup flags computed for the second of
these two entries will overwrite the mount setup flags computed for
the first of these two entries. This is the root cause of issue #7798.
This patch changes mount_setup_existing_unit to prevent the
just_mounted mount setup flag from being overwritten if it is set to
true. This will allow all mount units created from /proc/self/mountinfo
entries to be initialized properly.
Fixes: #7798
Now that we don't kill control processes anymore, let's at least warn
about any processes left-over in the unit cgroup at the moment of
starting the unit.
This introduces a new function unit_prepare_exec() that encapsulates a
number of calls we do in preparation for spawning off some processes in
all our unit types that do so.
This allows us to neatly unify a bit of code between unit types and
shorten our code.
And let's make use of it to implement two new unit settings with it:
1. LogLevelMax= is a new per-unit setting that may be used to configure
log priority filtering: set it to LogLevelMax=notice and only
messages of level "notice" and lower (i.e. more important) will be
processed, all others are dropped.
2. LogExtraFields= is a new per-unit setting for configuring per-unit
journal fields, that are implicitly included in every log record
generated by the unit's processes. It takes field/value pairs in the
form of FOO=BAR.
Also, related to this, one exisiting unit setting is ported to this new
facility:
3. The invocation ID is now pulled from /run/systemd/units/ instead of
cgroupfs xattrs. This substantially relaxes requirements of systemd
on the kernel version and the privileges it runs with (specifically,
cgroupfs xattrs are not available in containers, since they are
stored in kernel memory, and hence are unsafe to permit to lesser
privileged code).
/run/systemd/units/ is a new directory, which contains a number of files
and symlinks encoding the above information. PID 1 creates and manages
these files, and journald reads them from there.
Note that this is supposed to be a direct path between PID 1 and the
journal only, due to the special runtime environment the journal runs
in. Normally, today we shouldn't introduce new interfaces that (mis-)use
a file system as IPC framework, and instead just an IPC system, but this
is very hard to do between the journal and PID 1, as long as the IPC
system is a subject PID 1 manages, and itself a client to the journal.
This patch cleans up a couple of types used in journal code:
specifically we switch to size_t for a couple of memory-sizing values,
as size_t is the right choice for everything that is memory.
Fixes: #4089Fixes: #3041Fixes: #4441
There should be a way to turn this logic of, and DefaultDependencies=
appears to be the right option for that, hence let's downgrade this
dependency type from "implicit" to "default, and thus honour
DefaultDependencies=.
This also drops mount_get_fstype() as we only have a single user needing
this now.
A follow-up for #7076.
This replaces the dependencies Set* objects by Hashmap* objects, where
the key is the depending Unit, and the value is a bitmask encoding why
the specific dependency was created.
The bitmask contains a number of different, defined bits, that indicate
why dependencies exist, for example whether they are created due to
explicitly configured deps in files, by udev rules or implicitly.
Note that memory usage is not increased by this change, even though we
store more information, as we manage to encode the bit mask inside the
value pointer each Hashmap entry contains.
Why this all? When we know how a dependency came to be, we can update
dependencies correctly when a configuration source changes but others
are left unaltered. Specifically:
1. We can fix UDEV_WANTS dependency generation: so far we kept adding
dependencies configured that way, but if a device lost such a
dependency we couldn't them again as there was no scheme for removing
of dependencies in place.
2. We can implement "pin-pointed" reload of unit files. If we know what
dependencies were created as result of configuration in a unit file,
then we know what to flush out when we want to reload it.
3. It's useful for debugging: "systemd-analyze dump" now shows
this information, helping substantially with understanding how
systemd's dependency tree came to be the way it came to be.
Update the timeout warnings for remount and unmount. For consistency with
mount, for accuracy, and for consistency with their equivalents in
service.c.