Whether seccomp is supported or not is a server implementation detail,
the client should not be altered by that, and clients should be able to talk
to servers configured differently than the client, hence drop the
HAVE_SECCOMP ifdeffery here.
(This would be different if we'd need libseccomp or so to implement the
client, but we don't)
Both permit configuring data to pass through STDIN to an invoked
process. StandardInputText= accepts a line of text (possibly with
embedded C-style escapes as well as unit specifiers), which is appended
to the buffer to pass as stdin, followed by a single newline.
StandardInputData= is similar, but accepts arbitrary base64 encoded
data, and will not resolve specifiers or C-style escapes, nor append
newlines.
This may be used to pass input/configuration data to services, directly
in-line from unit files, either in a cooked or in a more raw format.
If the string length is specified as (size_t) -1, let's use that as
indicator for determining the length on our own. This makes it
slightlier shorter to invoke these APIs for a very common case.
Also, do some minor other coding style updates, and add assert()s here
and there.
All this function does is place some data in an in-memory read-only fd,
that may be read back to get the original data back.
Doing this in a way that works everywhere, given the different kernels
we support as well as different privilege levels is surprisingly
complex.
Mostly coding style fixes, but most importantly, initialize
c->std_input only after we know the free_and_strdup() invocation
succeeded, so that we don't leave half-initialized fields around on
failure.
Given that Linux assigns the same ioctl numbers ot multiple subsystems,
we should be careful when invoking ioctls, so that we don't end up
calling something we wouldn't want to call.
We are using the same pattern at various places: call dup2() on an fd,
and close the old fd, usually in combination with some O_CLOEXEC
fiddling. Let's add a little helper for this, and port a few obvious
cases over.
We currently use the ownership of the top-level directory as a hint
whether we need to descent into the whole tree to chown() it recursively
or not. This is problematic with the previous chown()ing algorithm, as
when descending into the tree we'd first chown() and then descend
further down, which meant that the top-level directory would be chowned
first, and an aborted recursive chowning would appear on the next
invocation as successful, even though it was not. Let's reshuffle things
a bit, to make the re-chown()ing safe regarding interruptions:
a) We chown() the dir we are looking at last, and descent into all its
children first. That way we know that if the top-level dir is
properly owned everything inside of it is properly owned too.
b) Before starting a chown()ing operation, we mark the top-level
directory as owned by a special "busy" UID range, which we can use to
recognize whether a tree was fully chowned: if it is marked as busy,
it's definitely not fully chowned, as the busy ownership will only be
fixed as final step of the chowning.
Fixes: #6292
Linux doesn't have faccess(), hence let's emulate it. Linux has access()
and faccessat() but neither allows checking the access rights of an fd
passed in directly.
Before this, assigning empty string to Delegate= makes no change to the
controller list. This is inconsistent to the other options that take list
of strings. After this, when empty string is assigned to Delegate=, the
list of controllers is reset. Such behavior is consistent to other options
and useful for drop-in configs.
Closes#7334.
This option is likely to be very useful for systemd-run invocations,
hence let's add a shortcut for it.
With this new concepts it's now very easy to put together systemd-run
invocations that leave zero artifacts in the system, including when they
fail.
Right now, the option only takes one of two possible values "inactive"
or "inactive-or-failed", the former being the default, and exposing same
behaviour as the status quo ante. If set to "inactive-or-failed" units
may be collected by the GC logic when in the "failed" state too.
This logic should be a nicer alternative to using the "-" modifier for
ExecStart= and friends, as the exit data is collected and logged about
and only removed when the GC comes along. This should be useful in
particular for per-connection socket-activated services, as well as
"systemd-run" command lines that shall leave no artifacts in the
system.
I was thinking about whether to expose this as a boolean, but opted for
an enum instead, as I have the suspicion other tweaks like this might be
a added later on, in which case we extend this setting instead of having
to add yet another one.
Also, let's add some documentation for the GC logic.
When preparing for a restart we quickly go through the DEAD/INACTIVE
service state before entering AUTO_RESTART. When doing this, we need to
make sure we don't destroy the FD store. Previously this was done by
checking the failure state of the unit, and keeping the FD store around
when the unit failed, under the assumption that the restart logic will
then get into action.
This is not entirely correct howver, as there might be failure states
that will no result in restarts.
With this commit we slightly alter the logic: a ref counter for the fd
store is added, that is increased right before we handle the restart
logic, and decreased again right-after.
This should ensure that the fdstore lives exactly as long as it needs.
Follow-up for f0bfbfac43.
And let's make use of it to implement two new unit settings with it:
1. LogLevelMax= is a new per-unit setting that may be used to configure
log priority filtering: set it to LogLevelMax=notice and only
messages of level "notice" and lower (i.e. more important) will be
processed, all others are dropped.
2. LogExtraFields= is a new per-unit setting for configuring per-unit
journal fields, that are implicitly included in every log record
generated by the unit's processes. It takes field/value pairs in the
form of FOO=BAR.
Also, related to this, one exisiting unit setting is ported to this new
facility:
3. The invocation ID is now pulled from /run/systemd/units/ instead of
cgroupfs xattrs. This substantially relaxes requirements of systemd
on the kernel version and the privileges it runs with (specifically,
cgroupfs xattrs are not available in containers, since they are
stored in kernel memory, and hence are unsafe to permit to lesser
privileged code).
/run/systemd/units/ is a new directory, which contains a number of files
and symlinks encoding the above information. PID 1 creates and manages
these files, and journald reads them from there.
Note that this is supposed to be a direct path between PID 1 and the
journal only, due to the special runtime environment the journal runs
in. Normally, today we shouldn't introduce new interfaces that (mis-)use
a file system as IPC framework, and instead just an IPC system, but this
is very hard to do between the journal and PID 1, as long as the IPC
system is a subject PID 1 manages, and itself a client to the journal.
This patch cleans up a couple of types used in journal code:
specifically we switch to size_t for a couple of memory-sizing values,
as size_t is the right choice for everything that is memory.
Fixes: #4089Fixes: #3041Fixes: #4441
When we drop messages of a unit, we log about. Let's add some structured
data to that. Let's include how many messages we dropped, but more
importantly, let's link up the message we generate to the unit we
dropped the messages from by using the "OBJECT" logic, i.e. by
generating OBJECT_SYSTEMD_UNIT= fields and suchlike, that "journalctl
-u" and friends already look for.
Fixes: #6494
In chroot environments, /etc might not be fully initialized: /etc/machine-id
can be missing for example. This makes the expansions of affected specifiers
impossible at that time.
These cases should not be considered as errors and such failures shouldn't be
logged at an error level therefore this patch downgrades the level used to
LOG_NOTICE in such cases.
Also this is logged at LOG_NOTICE only the first time and then downgrade to
LOG_DEBUG for the rest. That way, if debugging is enabled we get the full
output, but otherwise we only see only one message.
The expansion of specifiers is now self contained in a dedicated function
instead of being spread all over the place.
Implement DHCPv6 option to exchange information about the Fully
Qualified Domain Name (FQDN) according to RFC 4704.
The RFC 4704 describes two models of operations in section 3,
currently only the second model is supported (DHCPv6 server
updates both the AAAA and the PTR RRs).
The existing DHCP Section Options SendHostname and Hostname are
sent as FQDN to the server. According to section 4.2 sending
only parts of its FQDN is allowed.
Fixes#4682.
Technically DNS allows any ASCII character to be used in the
domain name. Also the DHCP specification for the FQDN option
(RFC 4702) does not put restriction on labels.
However, hostnames do have stricter requirements and typically
should only use characters from a-z (case insensitve), 0-9 and
minus.
Currently we require hostname/FQDN to be either a hostname or
a valid DNS name. Since dns_name_is_valid() allows any ASCII
characters this allows to specify hostnames which are typically
not valid.
Check hostname/FQDN more strictly and require them to pass both
tests. Specifically this requires the entire FQDN to be below 63.
We would continue, but still return an error at the end. This isn't useful
because we'd still error-out in main().
Also, add a missing error message when we fail to mkdir.
It is possible that kernel will reject our netlink message that
configures the link. However, we should always make sure that interface
will be named properly otherwise we can leave interfaces having
unpredictable kernel names. Thus we don't return early and continue to
export name and link file properties.
Suggested-by: Tom Gundersen <teg@jklm.no>
SetLinger is authorized by the PolicyKit action "set-self-linger", if it is
not passed an explicit UID.
According to comments we were determining the default UID from the client's
session. However, user processes e.g. which are run from a terminal
emulator do not necessarily belong to a session scope unit. They may
equally be started from the systemd user manager [1][2]. Actually the
comment was wrong, and it would also have worked for processes
started from the systemd user manager.
Nevertheless it seems to involve fetching "augmented credentials" i.e.
it's using a racy method, so we shouldn't have been authenticating based
on it.
We could change the default UID, but that raises issues especially for
consistency between the methods. Instead we can just use the clients
effective UID for authorization.
This commit also fixes `loginctl enable-linger $USER` to match the docs
that say it was equivalent to `loginctl enable-linger` (given that $USER
matches the callers user and owner_uid). Previously, the former would not
have suceeded for unpriviliged users in the default configuration.
[1] It seems the main meaning of per-session scopes is tracking the PAM
login process. Killing that provokes logind to revoke device access. Less
circularly, killing it provokes getty to hangup the TTY.
[2] User units may be started with an environment which includes
XDG_SESSION_ID (presuambly GNOME does this?). Or not.
To maintain consistency with `loginctl user-status`, drop the fallback to
XDG_SESSION_ID for `loginctl enable-linger`. The fallback was unnecessary
and also incorrect: it passed the numeric value of the session identifier
as a UID value.
The manpages tell that such calls have quite limited meaning. logind has
a few in the implementation of what remains of the session concept.
At the same time, logind basically exposes sd_pid_get_session() as public
API. This is absolutely required, to retain compatability e.g. with Xorg.
But client code will work in more situations if it avoids assuming that it
runs in a session itself.
Its use inside the login session could be replaced with $XDG_SESSION_ID
(which pam_systemd sets). I don't know whether it would be useful to
change Xorg at this point or not. But if you were building something new,
you would think about whether you want to support running it in a systemd
service.
Comment these logind API features, acknowledging the reason they exist is
based in history. I.e. help readers avoid drawing implications from their
existence which apply to history, but not the current general case.
Finally, searching these revealed a call to sd_pid_get_session() in
implementing some types of logind inhibitors. So these inhibitors don't
work as intended when taken from inside a systemd user service :(. Comment
this as well, deferring it as ticket #6852.
If you try to run `loginctl user-status` on a non-logged in user to see
whether "Linger" is enabled, it doesn't work.
If you're already an expert in logind, the fact that the user is considered
unknown actually tells you the user is not lingering. So, probably they
they do not have lingering enabled. I think we can point towards this
without being misleading.
I also reword it because I thought it was slightly confusing to run
`loginctl user-status root` and get an error back about "User 0". Try to
be more specific, that it is "User ID 0".
It's confusing that the bus API has aliases like "session/self" that return
an error based on ENXIO, when it also has methods that return e.g.
NO_SESSION_FOR_PID for the same problem. The latter kind of error includes
more specifically helpful messages.
"user/self" is the odd one out; it returns a generic UnknownObject error
when it is not applicable to the caller. It's not clear whether this was
intentional, but at first I thought it was more correct. More
specifically, user_object_find() was returning 0 for "user/self", in the
same situations (more or less) where user_node_enumerator() was omitting
"user/self". I thought that was a good idea, because returning e.g. -ENXIO instead
suggested that there _is_ something specific on that path. And it could be
confused with errors of the method being called.
Therefore I suggested changing the enumerator, always admitting that there
is a handler for the path "foo/self", but returning a specific error when
queried. However this interacts poorly with tools like D-Feet or `busctl`.
In either tool, looking at logind would show an error message, and then go
on to omit "user/self" in the normal listing. These tools are very useful,
so we don't want to interfere with them.
I think we can change the error codes without causing problems. The self
objects were not listed in the documentation. They have been suggested to
other projects - but without reference to error reporting. "seat/self" is
used by various Wayland compositors for VT switching, but they don't appear
to reference specific errors.
We _could_ insist on the link between enumeration and UnknownObject, and
standardize on that as the error for the aliases. But I'm not aware of any
practical complaints, that we returned an error from an object that didn't
exist.
Instead, let's unify the codepaths for "user/self" vs GetUserByPid(0) etc.
We will return the most helpful error message we can think of, if the
object does not exist. E.g. for "session/self", we might return an error
that the caller does not belong to a session. If one of the compositors is
ever simplified to use "session/self" in initialization, users would be
able to trigger such errors (e.g. run `gnome-shell` inside gnome-terminal).
The message text will most likely be logged. The user might not know what
the "session" is, but at least we'll be pointing towards the right
questions. I think it should also be clearer for development / debugging.
Unifying the code paths is also slightly helpful for auditing / marking
calls to sd_bus_creds_get_session() in subsequent commits.
GetSessionByPID(0) can fail with NO_SESSION_FOR_PID. More obscurely, if
the session is abandoned, it can return NO_SUCH_SESSION. It is not clear
that the latter was intended. The message associated with the former,
hints that this was overlooked.
We don't have a document enumerating the errors. Any specific
error-handling in client code, e.g. translated messages, would also be
liable to overlook the more obscure error code.
I can't see any equivalent condition for GetUserByPID(0). On the other
hand, the code did not return NO_USER_FOR_PID where it probably should.
The relevant code is right next to that for GetSessionByPID(0), so it will
be simpler to understand if both follow the same pattern.
PR #7186 changes requires_mounts_for from strv to Hashmap.
So, it is necessary to implement a function for getting the property RequiresMountsFor=.
This introduces property_get_requires_mounts_for() which reads the Hashmap
and sends messages to bus.
Fixes#7321.
If you use machinectl to bind a file into a container, it responds with a confusing error message about a temporary directory not being a directory.
I just swapped it to error with the source that was passed, rather than the tmpdir.
It would also be nice to be able to bind files, but that's a separate issue (#7195).
Before the change:
root@epona /var/lib/sandbox $ cat bar/foo
Hello world!
root@epona /var/lib/sandbox $ machinectl bind testing /var/lib/sandbox/bar/foo /foo
Failed to bind mount: Failed to overmount /tmp/propagate.W5TNsj/mount: Not a directory
After the change:
root@epona /var/lib/sandbox $ machinectl bind testing /var/lib/sandbox/bar/foo /foo
Failed to bind mount: Failed to overmount /var/lib/sandbox/bar/foo: Not a directory
Let's reduce the amount of noise a bit, there's little point in
complaining loudly about every single unit like this, let's complain
only about the first one, and then downgrade the log level to LOG_DEBUG
for the other cases.
Fixes: #7188
This didn't work during the initial conversion to meson, but should now.
A sufficiently new polkit is also required, for the .its rules files.
Note that https://github.com/mesonbuild/meson/blob/master/docs/markdown/i18n-module.md
says that 'install' argument was added in meson 0.43.0. If this is accurate,
warnigs might be generated with older mesons. Fedora has 0.43.0 across the
board, but other distros probably don't, but I guess that a warning is
prefereable to having to update do latest meson.
The advantages are:
- one less dependency (intltool)
- using the generic implementation instead of our open-coded calls
- we don't need to use the fake "_" prefixes in XML
Replaces #1609, fixes#7300.
We already print it as part of log_syntax() internal logic, don't print
it again, and in particular, don't print it at the end of log line, such
a strange place.
Follow-up for: 142468d895
So far, we assumed that kernels where TRIE was on also supported
BPF/cgroup stuff. That's not a correct assumption to make, hence check
for both features separately.
Fixes: #7054
Previously it was not possible to select which controllers to enable for
a unit where Delegate=yes was set, as all controllers were enabled. With
this change, this is made configurable, and thus delegation units can
pick specifically what they want to manage themselves, and what they
don't care about.
Otherwise it's a pointless excercise, as we'll set up an empty directory
tree that's never going to be used.
Hence, let's move this around a bit, so that we do the basesystem
initialization exactly when RootImage= or RootDirectory= are used, but
not otherwise.
When compiling with an old kernel on architectures for which the
number is not defined in missing.h, a warning is generated in missing.h.
Let's just skip the protection in this case, to allow build to proceed.
MemoryDenyWriteExecution policy could be be bypassed by using pkey_mprotect
instead of mprotect to create an executable writable mapping.
The impact is mitigated by the fact that the man page says "Note that this
feature is fully available on x86-64, and partially on x86", so hopefully
people do not rely on it as a sole security measure.
Found by Karin Hossen and Thomas Imbert from Sogeti ESEC R&D.
https://bugs.launchpad.net/bugs/1725348
Let's clarify that it's not networkd that renames interfaces, but
something else (for example, udev's link builtin based on .link files)
This doesn't change any logic, it just rewords the message a bit, to
clarify that we only log this for informational purposes, not because we
execute the rename operation ourselves.
Fixes: #7143
Fix loginctl seat sysfs tree ellipsation logic.
Simple reproducer:
loginctl --full seat-status seat0|cat
→ after this PR, all lines are shown in full. Before, lines were ellipsized to terminal width.
This makes each system call in SystemCallFilter= blacklist optionally
takes errno name or number after a colon. The errno takes precedence
over the one given by SystemCallErrorNumber=.
C.f. #7173.
Closes#7169.
Let's hook up the sysfs tree output with the output flags logic,
already used when dumping log lines or process trees. This way we get
very similar output handling for line breaking/ellipsation in all three
outputs of structured data.
Fixes: #7095
Let's say that (size_t) -1 (i.e. SIZE_T_MAX) is equivalent to
"unbounded" ellipsation, i.e. ellipsation as NOP. In which case the
relevant functions become little more than strdup()/strndup().
This is useful to simplify caller code in case we want to turn off
ellipsation in certain code paths with minimal caller-side handling for
this.
loginctl, machinectl, systemctl all have very similar implementations of
a get_output_flags() functions. Simplify it by merging two lines that
set the same flag.
There should be a way to turn this logic of, and DefaultDependencies=
appears to be the right option for that, hence let's downgrade this
dependency type from "implicit" to "default, and thus honour
DefaultDependencies=.
This also drops mount_get_fstype() as we only have a single user needing
this now.
A follow-up for #7076.
Previously dependencies configured with SYSTEMD_WANTS would be collected
on a device unit as long as it was loaded. let's fix that, and remove
dependencies again when SYTEMD_WANTS changes.
This replaces the dependencies Set* objects by Hashmap* objects, where
the key is the depending Unit, and the value is a bitmask encoding why
the specific dependency was created.
The bitmask contains a number of different, defined bits, that indicate
why dependencies exist, for example whether they are created due to
explicitly configured deps in files, by udev rules or implicitly.
Note that memory usage is not increased by this change, even though we
store more information, as we manage to encode the bit mask inside the
value pointer each Hashmap entry contains.
Why this all? When we know how a dependency came to be, we can update
dependencies correctly when a configuration source changes but others
are left unaltered. Specifically:
1. We can fix UDEV_WANTS dependency generation: so far we kept adding
dependencies configured that way, but if a device lost such a
dependency we couldn't them again as there was no scheme for removing
of dependencies in place.
2. We can implement "pin-pointed" reload of unit files. If we know what
dependencies were created as result of configuration in a unit file,
then we know what to flush out when we want to reload it.
3. It's useful for debugging: "systemd-analyze dump" now shows
this information, helping substantially with understanding how
systemd's dependency tree came to be the way it came to be.
We neglected to set error_message for errors which occur _after_ the
`finish` label. These fatal errors only happen in paths where `finish`
was reached successfully, i.e. error_message has not already been set
(and this analysis is simple enough that this need not cause too much
headaches. Also our new assignments to error_message come immediately
after execve() calls, which would have lost the error_message if it had
been set).
Also print a status message when we fail to exec init, otherwise the only
sign the user will see is `# ` :).
This addresses the lack of error messages pointed out in issue #6827.
It was dropped in 89439d4fc0. As a result, every
process that uses a hashmap allocates and then leaks the hashmap mempools.
The mempools are only allocated in the main thread, but we don't know where
the memory is used.
So let's check if we are the last thread and free the mempools then. This is
fairly heavy, because /proc/self/status has to be opened and parsed, but we do
it only when compiled for valgrind, i.e. not by default, and compared to running
under valgrind or asan, the extra cost is acceptable. The big advantage is that
we don't have to think or filter out this false positive.
As a micro-opt, cleanup is attempted only in the main thread. We could allow
any thread to check if it is the last one and perform cleanup, but that'd mean
that we'd have to _do_ the check in every thread. We don't use threads like
that, our non-main threads are always short-lived, so let's just accept the
possibility that we'll leak memory if a thread survives. The check is also
non-atomic, but it's called in a destructor of the main thread _and_ we do
cleanup only when there are no other threads, so the risk of some library
suddenly spawning another thread is very low. All in all, this is not perfect,
but should work in 999‰ of cases.
Fixes the following valgrind warning:
==22564== HEAP SUMMARY:
==22564== in use at exit: 8,192 bytes in 2 blocks
==22564== total heap usage: 243 allocs, 241 frees, 151,905 bytes allocated
==22564==
==22564== 4,096 bytes in 1 blocks are still reachable in loss record 1 of 2
==22564== at 0x4C2FB6B: malloc (vg_replace_malloc.c:299)
==22564== by 0x4F08A8C: mempool_alloc_tile (mempool.c:62)
==22564== by 0x4F08B16: mempool_alloc0_tile (mempool.c:81)
==22564== by 0x4EF8DE0: hashmap_base_new (hashmap.c:748)
==22564== by 0x4EF8ED9: internal_hashmap_new (hashmap.c:782)
==22564== by 0x11045D: test_hashmap_copy (test-hashmap-plain.c:87)
==22564== by 0x115722: test_hashmap_funcs (test-hashmap-plain.c:914)
==22564== by 0x10FC9D: main (test-hashmap.c:60)
==22564==
==22564== 4,096 bytes in 1 blocks are still reachable in loss record 2 of 2
==22564== at 0x4C2FB6B: malloc (vg_replace_malloc.c:299)
==22564== by 0x4F08A8C: mempool_alloc_tile (mempool.c:62)
==22564== by 0x4F08B16: mempool_alloc0_tile (mempool.c:81)
==22564== by 0x4EF8DE0: hashmap_base_new (hashmap.c:748)
==22564== by 0x4EF8EF8: internal_ordered_hashmap_new (hashmap.c:786)
==22564== by 0x10A2A0: test_ordered_hashmap_copy (test-hashmap-ordered.c:89)
==22564== by 0x10F70F: test_ordered_hashmap_funcs (test-hashmap-ordered.c:916)
==22564== by 0x10FCA2: main (test-hashmap.c:61)
==22564==
==22564== LEAK SUMMARY:
==22564== definitely lost: 0 bytes in 0 blocks
==22564== indirectly lost: 0 bytes in 0 blocks
==22564== possibly lost: 0 bytes in 0 blocks
==22564== still reachable: 8,192 bytes in 2 blocks
==22564== suppressed: 0 bytes in 0 blocks
v2:
- check if we are the main thread
v3:
- check if there are no other threads
PR #3865 introduced RemoveIPC= but the option is not listed in
load-fragment-gperf.gperf. So, the option could be used only via d-bus.
This adds RemoveIPC= in load-fragment-gperf.gperf. Then, now we can
set the option in unit files.
Fixes#7281.
This path to ping is compatible with both debian-like and usr-merged
distros. This keeps the test simple, and should thus pass everywhere.
Fixes: #7267
- Remove the uaccess tag from /dev/dri/renderD*.
- Change the owning group from video to render.
- Change default mode to 0666.
- Add an option to allow users to set the access mode for these devices at
compile time.
The linux umount2() systemcall accepts a MNT_FORCE flags
which some filesystems honor, particularly FUSE and various
network filesystems such as NFS.
These filesystems can sometimes wait for an indefinite period
for a response from an external service, and the wait if
sometimes "uninterruptible" meaning that the process cannot be
killed.
Using MNT_FORCE causes any such request that are outstanding to
be aborted. This normally allows the waiting process to
be killed. It will then realease and reference it has to the
filesytem, this allowing the filesystem to be unmounted.
If there remain active references to the filesystem, MNT_FORCE
is *not* forcefull enough to unmount the filesystem anyway.
By the time that umount_all() is run by systemd-shutdown, all
filesystems *should* be unmounted, and sync() will have been
called. Anything that remains cannot be unmounted in a
completely clean manner and just nees to be dealt with as firmly
as possible. So use MNT_FORCE and try to explain why in the
comment.
Also enhance an earlier comment to explain why umount2() is
safe even though mount(MNT_REMOUNT) isn't.
RuntimeDirectory= often used for sharing files or sockets with other
services. So, if creating them under private/ sub-directory, we cannot
set DynamicUser= to service units which want to share something through
RuntimeDirectory=.
This makes the directories given by RuntimeDirectory= are created under
/run/ even if DynamicUser= is set.
Fixes#7260.
When at least one of BindPaths=, BindReadOnlyPaths=, RootImage=,
RuntimeDirectory= or their friends are set, systemd prepares
a namespace under /run/systemd/unit-root. Thus, ReadWritePaths=
or their friends without '+' prefix is completely meaningless.
So, let's assume '+' prefix when one of them are set.
Fixes#7070 and #7080.
job_finish_and_invalidate() calls job_free() to destroy jobs (and remove
them from the dbus queue). So we don't need to add them to the dbus queue
first.
We only want to add jobs to the dbus queue if they're a restart job, which
we're transmogrifying into a start job and putting back into the system.
During startup of networkd we try to drop the configs. While droping
routes we filling ip route type and because of which message like
```
host: Could not drop route: Invalid argument
host: Could not drop route: Invalid argument
```
are shown.
Closed#6929
_unused_ means "the variable is meant to be possible unused and gcc
will not generate a warning about it", which is exactly what we need here,
since we're only declaring it for the side effect of _cleanup_.
clang warns about a few sites like this:
../src/journal/journal-file.c:1780:48: warning: taking address of packed member 'entry_offset' of class or structure 'DataObject' may result in an unaligned pointer value [-Waddress-of-packed-member]
&o->data.entry_offset,
^~~~~~~~~~~~~~~~~~~~
but DataObject.entry_offset will always be 8-byte aligned as long as
the DataObject structure is aligned. Similarly in other cases, the
field is always aligned. Let's just silence the warning to avoid noise.
gcc does not know -Waddress-of-packed-member, and would warn about an unknown
warning, so we need to conditionalize on __clang__.
../src/network/networkd-link.c:3577:84: warning: format specifies type 'unsigned char' but the argument has type 'uint32_t' (aka 'unsigned int') [-Wformat]
route->dst_prefixlen, route->tos, route->priority, route->table, route->lifetime);
^~~~~~~~~~~~
../src/network/networkd-manager.c:1146:132: warning: format specifies type 'unsigned char' but the argument has type 'uint32_t' (aka 'unsigned int') [-Wformat]
rule->from_prefixlen, space ? " " : "", to_str, rule->to_prefixlen, rule->tos, rule->fwmark, rule->fwmask, rule->table);
^~~~~~~~~~~
Also add some line breaks to make it easier to see which argument is for which
part of the format string.
clang warns:
../src/import/importd.c:254:70: warning: 'break' is bound to current loop, GCC binds it to the enclosing loop [-Wgcc-compat]
while ((e < t->log_message + t->log_message_size) && IN_SET(*e, 0, '\n'))
^
Let's just play it safe and not use IN_SET here.
../src/journal/journald-native.c:341:13: warning: variable 'context' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
if (ucred && pid_is_valid(ucred->pid)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../src/journal/journald-native.c:350:42: note: uninitialized use occurs here
context, ucred, tv, label, label_len);
^~~~~~~
../src/journal/journald-native.c:335:31: note: initialize the variable 'context' to silence this warning
ClientContext *context;
^
= NULL
Very nice reporting!
Functions that we call can handle context == NULL, so it's enough to simply
initialize the variable.
Failure to spawn ExecStartPost was being handled differently to e.g.
EXIT_FAILURE returned by ExecStartPost. It looks like this was an
oversight. Fix to match documented behaviour.
`man systemd.service`:
> Note that if any of the commands specified in ExecStartPre=, ExecStart=,
> or ExecStartPost= fail (and are not prefixed with "-", see above) or time
> out before the service is fully up, execution continues with commands
> specified in ExecStopPost=, the commands in ExecStop= are skipped.
Update the timeout warnings for remount and unmount. For consistency with
mount, for accuracy, and for consistency with their equivalents in
service.c.
manager_connect_bus() is called *before* manager_coldplug(). As a last
thing in service_coldplug() we set service state to
s->deserialized_state, and thus before we do that all services are
inactive and try_connect always evaluates to false. To fix that we must
look at deserialized state instead of current unit state.
Fixes#7146
Using `kill()` with a signal of 0 is a slightly more documented idiom for
checking whether a process still exists. It is mentioned explicitly in
man pages. This avoids the need to comment the call as "misuse".
A comment is still necessary - in fact this idiom is even more confusing if
you don't know how it works. But it's easy enough to explain.
This option allows restricting the shown fields in the output modes that
would normally show all fields. It allows clients that are only
interested in a subset of the fields to access those more efficiently.
Also, it makes the resulting size of the output more predictable.
It has no effect on the various `short` output modes, because those
already only show a subset of the fields.
This augments %t which already resolves to the runtime directory root, and
should be useful for units that want to pass any of these paths in
command line arguments.
Example:
ExecStart=/usr/bin/mydaemon --datadir=%S/mydaemon
Why not expose a specifier resolving directly to the configured
state/runtime/cache/log dir? Three reasons:
1. Specifiers should be independent of configuration of the unit itself,
and StateDirectory= and friends are unit configuration. See
03fc9c723c and related work.
2. We permit multiple StateDirectory= values per unit, and it hence
wouldn't be clear which one is passed.
3. We already have %t for the runtime directory root, and we should
continue with the same scheme.
This adds some simply detection logic for cases where dissection is
invoked on an externally created loop device, and partitions have been
detected on it, but partition scanning so far was off. If this is
detected we now print a brief message indicating what the issue is,
instead of failing with a useless EINVAL message the kernel passed to
us.
This adds some basic discovery of block device images for nspawn and
friends. Note that this doesn't add searching for block devices using
udev, but instead expects users to symlink relevant block devices into
/var/lib/machines. Discovery is hence done exactly like for
dir/subvol/raw file images, except that what is found may be a (symlink
to) a block device.
For now, we do not support cloning these images, but removal, renaming
and read-only flags are supported to the point where that makes sense.
Fixe: #6990
If we follow an absolute symlink there's no need to prefix the path with
a "/", since by definition it already has one.
This helps suppressing double "/" in resolved paths containing absolute
symlinks.
Some of the btrfs utility functions already used O_NOFOLLOW others
didn't. Let's streamline this, and refuse operation when we are called
for symlinks on "remove" and "snapshot" too.
In particular in the "remove" case following symlinks is a bad idea, and
is quite different from how unlink() and friends work, which always
remove the symlink, and not the destination, a logic we should follow
here too.
This fixes --read-only with --private-users. mkdir_userns_p may return
-EROFS if either mkdir or lchown fails; lchown failing is fine as the
mount point will just be overmounted, and if mkdir fails then the
following mount() will also fail (with ENOENT).
After previous output from systemd-shutdown indicated a bug, my attention
was drawn to redundant output lines. Did they indicate an anomaly?
It turns out to be an expected, harmless result of the current code. But
we don't have much justification to run such redundant operations. Let's
remove the confusing redundant message.
We can stop trying to remount a directory read-only once its mount entry
has successfully been changed to "ro". We can simply let the kernel keep
track of this for us. I don't bother to try and avoid re-parsing the
mountinfo. I appreciate snappy shutdowns, but this code is already
intricate and buggy enough (see issue 7131).
(Disclaimer: At least for the moment, you can't _rely_ on always seeing
suspicious output from systemd-shutdown. By default, you can expect the
kernel to truncate the log output of systemd-shutdown. Ick ick ick!
Because /dev/kmsg is rate-limited by default. Normally it prints a message
"X lines supressed", but we tend to shut down before the timer expires
in this case).
Before:
systemd-shutdown[1]: Remounting '/' read-only with options 'seclabel...
EXT4-fs (vda3): re-mounted. Opts: data=ordered
systemd-shutdown[1]: Remounting '/' read-only with options 'seclabel, ...
EXT4-fs (vda3): re-mounted. Opts: data=ordered
After:
systemd-shutdown[1]: Remounting '/' read-only with options 'seclabel, ...
EXT4-fs (vda3): re-mounted. Opts: data=ordered
I also tested with `systemctl reboot --force`, plus a loopback mount to
cause one of the umounts to fail initially. In this case another 2 lines
of output are removed (out of a larger number of lines).
This creates a second private resolve.conf file which lists the stub resolver
and the resolved acquired search domains.
This runtime file should be used as a symlink target for /etc/resolv.conf such
that non-nss based applications can resolve search domains.
Fixes: #7009
When a user logs in, systemd-pam will wait for the user manager instance to
report readiness. We don't need to wait for all the jobs to finish, it
is enough if the basic startup is done and the user manager is responsive.
systemd --user will now send out a READY=1 notification when either of two
conditions becomes true:
- basic.target/start job is gone,
- the initial transaction is done.
Also fixes#2863.
The current code shifting an integer 1 failed for capabilities like
CAP_MAC_ADMIN (numerical value 33). This caused issues when specifying
them in the nspawn configuration file. Using an uint64_t 1 instead.
The similar code for processing the --capability command line option
was already correctly working.
There's a slight change in implementation: we first try to append the
version, then look for any non-unique pairs again. Before, we would only
mark as possibly unique those entries we changed. But if there are two
entries that e.g. have the same title and version, but only one has the
machine-id specified, we would treat one of them as still non-unique after
appending the machine-id to the other one. So the new algorithm is simpler
but more robust (not that it matters).
The assumption was that nothing changes in the final attempt. This
would be confusing if a filesystem with a process in uninterruptible
sleep suddenly became un-stuck for the final attempt, but we still give
up and don't try to e.g. unmount any parent mounts.
I don't know how possible that is. But the code will be easier to read
without an assumption that it does not attempt to justify.
scope_abandon() called unit_watch_all_pids() to check if there are any pids in
the cgroup, but unit_watch_all_pids() does nothing (and returns an empty set of
pids) when cg_unified_controller(SYSTEMD) returns true. On hybrid or unified,
cg_unified_controller(SYSTEMD) returns 1, so scope_abandon() thinks the scope
is empty, even though it's not, and marks the scope dead prematurely.
Example output after the scope is marked dead with processes still being present:
● session-24.scope - Session 24 of user guest
Loaded: loaded (/run/systemd/transient/session-24.scope; transient; vendor preset: disabled)
Transient: yes
Active: inactive (dead) since Sat 2017-10-14 15:36:22 CEST; 5min ago
Tasks: 1
CGroup: /user.slice/user-1001.slice/session-24.scope
└─17309 sleep infinity
Subsequent calls to stop the scope unit do nothing, because systemd thinks the
scope is already dead.
https://bugzilla.redhat.com/show_bug.cgi?id=1486859
This is easily reproducible on both F26 and F27:
> ssh guest@localhost
$ pulseaudio & sleep infinity & disown; exit
$ systemctl status $$
> systemctl status session-NN.scope
Tested with unified, hybrid, and legacy layouts, seems to work.
In https://bugzilla.redhat.com/show_bug.cgi?id=1486859 error messages appera:
Sep 06 19:09:07 ld92.e.math.uh.edu audit[21482]: AVC avc: denied { read } for pid=21482 comm="systemd-logind" name="dbus-1" dev="tmpfs" ino=5548194 scontext=system_u:system_r:systemd_logind_t:s0 tcontext=unconfined_u:object_r:session_dbusd_tmp_t:s0 tclass=dir permissive=0
Sep 06 19:09:07 ld92.e.math.uh.edu systemd-logind[21482]: Failed to remove runtime directory /run/user/8664: Permission denied
But it's not clear which of the two rm_rf's is the source. Let's make
them different.