This modernizes acquire_terminal() in a couple of ways:
1. The three boolean arguments are replaced by a flags parameter, that
should be more descriptive in what it does.
2. We now properly handle inotify queue overruns
3. We use _cleanup_ for closing the fds now, to shorten the code quite a
bit.
Behaviour should not be altered by this.
Before this, each ExecRuntime object is owned by a unit. However,
it may be shared with other units which enable JoinsNamespaceOf=.
Thus, by the serialization/deserialization process, its sharing
information, more specifically, reference counter is lost, and
causes issue #7790.
This makes ExecRuntime objects be managed by manager, and changes
the serialization/deserialization process.
Fixes#7790.
Using wait_for_terminate_and_check() instead of wait_for_terminate()
let's us simplify, shorten and unify the return value checking and
logging of waitid(). Hence, let's use it all over the place.
This adds a new safe_fork() wrapper around fork() and makes use of it
everywhere. The new wrapper does a couple of things we previously did
manually and separately in a safer, more correct and automatic way:
1. Optionally resets signal handlers/mask in the child
2. Sets a name on all processes we fork off right after forking off (and
the patch assigns useful names for all processes we fork off now,
following a systematic naming scheme: always enclosed in () – in order
to indicate that these are not proper, exec()ed processes, but only
forked off children, and if the process is long-running with only our
own code, without execve()'ing something else, it gets am "sd-" prefix.)
3. Optionally closes all file descriptors in the child
4. Optionally sets a PR_SET_DEATHSIG to SIGTERM in the child, in a safe
way so that the parent dying before this happens being handled
safely.
5. Optionally reopens the logs
6. Optionally connects stdin/stdout/stderr to /dev/null
7. Debug logs about the forked off processes.
This makes things a bit easier to read I think, and also makes sure we
always use the _unlikely_ wrapper around it, which so far we used
sometimes and other times we didn't. Let's clean that up.
Our CODING_STYLE suggests not comparing with NULL, but relying on C's
downgrade-to-bool feature for that. Fix up some code to match these
guidelines. (This is not comprehensive, the coccinelle output for this
is unfortunately kinda borked)
This suppresses the following warning
```
execute.c:2149:12: warning: ‘setup_smack’ defined but not used [-Wunused-function]
static int setup_smack(
^~~~~~~~~~~
```
This makes sure we migrate /var/lib/<foo> if it exists to
/var/lib/private/<foo> if DynamicUser=1 is set. This is useful to allow
turning on DynamicUser= on services that previously didn't use it, and
we can deal with this, and migrate the relevant directories as
necessary.
Note that "downgrading" from DynamicUser=1 backto DynamicUser=0 works
too. However in that case we simply continue to use
/var/lib/private/<foo>, which works because /var/lib/<foo> is a symlink
there after all.
We never use these functions seperately, hence don't bother splitting
them into to.
Also, simplify things a bit, and maintain tables for the attribute files
to chown. Let's also update those tables a bit, and include thenew
"cgroup.threads" file in it, that needs to be delegated too, according
to the documentation.
Smack LSM needs the capability CAP_MAC_ADMIN to allow
setting of the current Smack exec label. Consequently,
dropping capabilities must be done after changing the
current exec label.
This is only related to Smack LSM. But for clarity and
regularity, all setting of security context moved before
dropping capabilities.
See Issue 7108
These new settings permit specifiying arbitrary paths as
stdin/stdout/stderr locations. We try to open/create them as necessary.
Some special magic is applied:
1) if the same path is specified for both input and output/stderr, we'll
open it only once O_RDWR, and duplicate them fd instead.
2) If we an AF_UNIX socket path is specified, we'll connect() to it,
rather than open() it. This allows invoking systemd services with
stdin/stdout/stderr connected to arbitrary foreign service sockets.
Fixes: #3991
property_get_output_fdname() already had two different control flows for
stdout and stderr, it might as well handle stdin too, thus shortening
our code a bit.
Both permit configuring data to pass through STDIN to an invoked
process. StandardInputText= accepts a line of text (possibly with
embedded C-style escapes as well as unit specifiers), which is appended
to the buffer to pass as stdin, followed by a single newline.
StandardInputData= is similar, but accepts arbitrary base64 encoded
data, and will not resolve specifiers or C-style escapes, nor append
newlines.
This may be used to pass input/configuration data to services, directly
in-line from unit files, either in a cooked or in a more raw format.
Given that Linux assigns the same ioctl numbers ot multiple subsystems,
we should be careful when invoking ioctls, so that we don't end up
calling something we wouldn't want to call.
We are using the same pattern at various places: call dup2() on an fd,
and close the old fd, usually in combination with some O_CLOEXEC
fiddling. Let's add a little helper for this, and port a few obvious
cases over.
And let's make use of it to implement two new unit settings with it:
1. LogLevelMax= is a new per-unit setting that may be used to configure
log priority filtering: set it to LogLevelMax=notice and only
messages of level "notice" and lower (i.e. more important) will be
processed, all others are dropped.
2. LogExtraFields= is a new per-unit setting for configuring per-unit
journal fields, that are implicitly included in every log record
generated by the unit's processes. It takes field/value pairs in the
form of FOO=BAR.
Also, related to this, one exisiting unit setting is ported to this new
facility:
3. The invocation ID is now pulled from /run/systemd/units/ instead of
cgroupfs xattrs. This substantially relaxes requirements of systemd
on the kernel version and the privileges it runs with (specifically,
cgroupfs xattrs are not available in containers, since they are
stored in kernel memory, and hence are unsafe to permit to lesser
privileged code).
/run/systemd/units/ is a new directory, which contains a number of files
and symlinks encoding the above information. PID 1 creates and manages
these files, and journald reads them from there.
Note that this is supposed to be a direct path between PID 1 and the
journal only, due to the special runtime environment the journal runs
in. Normally, today we shouldn't introduce new interfaces that (mis-)use
a file system as IPC framework, and instead just an IPC system, but this
is very hard to do between the journal and PID 1, as long as the IPC
system is a subject PID 1 manages, and itself a client to the journal.
This patch cleans up a couple of types used in journal code:
specifically we switch to size_t for a couple of memory-sizing values,
as size_t is the right choice for everything that is memory.
Fixes: #4089Fixes: #3041Fixes: #4441
This makes each system call in SystemCallFilter= blacklist optionally
takes errno name or number after a colon. The errno takes precedence
over the one given by SystemCallErrorNumber=.
C.f. #7173.
Closes#7169.
RuntimeDirectory= often used for sharing files or sockets with other
services. So, if creating them under private/ sub-directory, we cannot
set DynamicUser= to service units which want to share something through
RuntimeDirectory=.
This makes the directories given by RuntimeDirectory= are created under
/run/ even if DynamicUser= is set.
Fixes#7260.
The error message corresponds to EILSEQ is "Invalid or incomplete
multibyte or wide character", and is not suitable in this case.
So, let's show a custom error message when the function
dynamic_creds_realize() returns -EILSEQ.
We generally use the casing "Namespace" for the word, and that's visible
in a number of user-facing interfaces, including "RestrictNamespace=" or
"JoinsNamespaceOf=". Let's make sure to use the same casing internally
too.
As discussed in #7024
The advantage is that is the name is mispellt, cpp will warn us.
$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build
squash! build-sys: use #if Y instead of #ifdef Y everywhere
v2:
- fix incorrect setting of HAVE_LIBIDN2
Let's optimize dynamic UID allocation a bit: if a StateDirectory= (or
suchlike) is configured, we start our allocation loop from that UID and
use it if it currently isn't used otherwise. This is beneficial as it
saves us from having to expensively recursively chown() these
directories in the typical case (which StateDirectory= does when it
notices that the owner of the directory doesn't match the UID picked).
With this in place we now have the a three-phase logic for allocating a
dynamic UID:
a) first, we try to use the owning UID of StateDirectory=,
CacheDirectory=, LogDirectory= if that exists and is currently
otherwise unused.
b) if that didn't work out, we hash the UID from the service name
c) if that didn't yield an unused UID either, randomly pick new ones
until we find a free one.
Let's clean up the interaction of StateDirectory= (and friends) to
DynamicUser=1: instead of creating these directories directly below
/var/lib, place them in /var/lib/private instead if DynamicUser=1 is
set, making that directory 0700 and owned by root:root. This way, if a
dynamic UID is later reused, access to the old run's state directory is
prohibited for that user. Then, use file system namespacing inside the
service to make /var/lib/private a readable tmpfs, hiding all state
directories that are not listed in StateDirectory=, and making access to
the actual state directory possible. Mount all directories listed in
StateDirectory= to the same places inside the service (which means
they'll now be mounted into the tmpfs instance). Finally, add a symlink
from the state directory name in /var/lib/ to the one in
/var/lib/private, so that both the host and the service can access the
path under the same location.
Here's an example: let's say a service runs with StateDirectory=foo.
When DynamicUser=0 is set, it will get the following setup, and no
difference between what the unit and what the host sees:
/var/lib/foo (created as directory)
Now, if DynamicUser=1 is set, we'll instead get this on the host:
/var/lib/private (created as directory with mode 0700, root:root)
/var/lib/private/foo (created as directory)
/var/lib/foo → private/foo (created as symlink)
And from inside the unit:
/var/lib/private (a tmpfs mount with mode 0755, root:root)
/var/lib/private/foo (bind mounted from the host)
/var/lib/foo → private/foo (the same symlink as above)
This takes inspiration from how container trees are protected below
/var/lib/machines: they generally reuse UIDs/GIDs of the host, but
because /var/lib/machines itself is set to 0700 host users cannot access
files in the container tree even if the UIDs/GIDs are reused. However,
for this commit we add one further trick: inside and outside of the unit
/var/lib/private is a different thing: outside it is a plain,
inaccessible directory, and inside it is a world-readable tmpfs mount
with only the whitelisted subdirs below it, bind mounte din. This
means, from the outside the dir acts as an access barrier, but from the
inside it does not. And the symlink created in /var/lib/foo itself
points across the barrier in both cases, so that root and the unit's
user always have access to these dirs without knowing the details of
this mounting magic.
This logic resolves a major shortcoming of DynamicUser=1 units:
previously they couldn't safely store persistant data. With this change
they can have their own private state, log and data directories, which
they can write to, but which are protected from UID recycling.
With this change, if RootDirectory= or RootImage= are used it is ensured
that the specified state/log/cache directories are always mounted in
from the host. This change of semantics I think is much preferable since
this means the root directory/image logic can be used easily for
read-only resource bundling (as all writable data resides outside of the
image). Note that this is a change of behaviour, but given that we
haven't released any systemd version with StateDirectory= and friends
implemented this should be a safe change to make (in particular as
previously it wasn't clear what would actually happen when used in
combination). Moreover, by making this change we can later add a "+"
modifier to these setings too working similar to the same modifier in
ReadOnlyPaths= and friends, making specified paths relative to the
container itself.
In most cases we followed the rule that the special _INVALID and _MAX
values we use in our enums use the full type name as prefix (in contrast
to regular values that we often make shorter), do so for
ExecDirectoryType as well.
No functional changes, just a little bit of renaming to make this code
more like the rest.
This is particularly useful when used in conjunction with DynamicUser=1,
where the UID might change for every invocation, but is useful in other
cases too, for example, when these directories are shared between
systems where the UID assignments differ slightly.
Just in case something opened them, let's make sure glibc invalidates
them too.
Thankfully so far no library opened log channels behind our back, at
least as far as I know, hence this is actually a NOP, but let's better
be safe than sorry.
Now that logging can implicitly reopen the log streams when needed we
can log errors without any special magic, hence let's normalize things,
and log the same way we do everywhere else.
This adds IOVEC_INIT() and IOVEC_MAKE() for initializing iovec structures
from a pointer and a size. On top of these IOVEC_INIT_STRING() and
IOVEC_MAKE_STRING() are added which take a string and automatically
determine the size of the string using strlen().
This patch removes the old IOVEC_SET_STRING() macro, given that
IOVEC_MAKE_STRING() is now useful for similar purposes. Note that the
old IOVEC_SET_STRING() invocations were two characters shorter than the
new ones using IOVEC_MAKE_STRING(), but I think the new syntax is more
readable and more generic as it simply resolves to a C99 literal
structure initialization. Moreover, we can use very similar syntax now
for initializing strings and pointer+size iovec entries. We canalso use
the new macros to initialize function parameters on-the-fly or array
definitions. And given that we shouldn't have so many ways to do the
same stuff, let's just settle on the new macros.
(This also converts some code to use _cleanup_ where dynamically
allocated strings were using IOVEC_SET_STRING() before, to modernize
things a bit)
Usually, it's a good thing that we isolate the kernel session keyring
for the various services and disconnect them from the user keyring.
However, in case of the cryptsetup key caching we actually want that
multiple instances of the cryptsetup service can share the keys in the
root user's user keyring, hence we need to be able to disable this logic
for them.
This adds KeyringMode=inherit|private|shared:
inherit: don't do any keyring magic (this is the default in systemd --user)
private: a private keyring as before (default in systemd --system)
shared: the new setting
If two separate log streams are connected to stdout and stderr, let's
make sure $JOURNAL_STREAM points to the latter, as that's the preferred
log destination, and the environment variable has been created in order
to permit services to automatically upgrade from stderr based logging to
native journal logging.
Also, document this behaviour.
Fixes: #6800
With this setting we can explicitly unset specific variables for
processes of a unit, as last step of assembling the environment block
for them. This is useful to fix#6407.
While we are at it, greatly expand the documentation on how the
environment block for forked off processes is assembled.
glibc appears to propagate different errors in different ways, let's fix
this up, so that our own code doesn't get confused by this.
See #6752 + #6737 for details.
Fixes: #6755
Let's clean up the checking for the various ExecOutput values a bit,
let's use IN_SET everywhere, and the same concepts for all three bools
we pass to dprintf().
Let's lock the personality to the currently set one, if nothing is
specifically specified. But do so with a grain of salt, and never
default to any exotic personality here, but only PER_LINUX or
PER_LINUX32.
Add LockPersonality boolean to allow locking down personality(2)
system call so that the execution domain can't be changed.
This may be useful to improve security because odd emulations
may be poorly tested and source of vulnerabilities, while
system services shouldn't need any weird personalities.
This patch adds two new special character prefixes to ExecStart= and
friends, in addition to the existing "-", "@" and "+":
"!" → much like "+", except with a much reduced effect as it only
disables the actual setresuid()/setresgid()/setgroups() calls, but
leaves all other security features on, including namespace
options. This is very useful in combination with
RuntimeDirectory= or DynamicUser= and similar option, as a user
is still allocated and used for the runtime directory, but the
actual UID/GID dropping is left to the daemon process itself.
This should make RuntimeDirectory= a lot more useful for daemons
which insist on doing their own privilege dropping.
"!!" → Similar to "!", but on systems supporting ambient caps this
becomes a NOP. This makes it relatively straightforward to write
unit files that make use of ambient capabilities to let systemd
drop all privs while retaining compatibility with systems that
lack ambient caps, where priv dropping is the left to the daemon
codes themselves.
This is an alternative approach to #6564 and related PRs.
These booleans simply store whether selinux/apparmor/smack are supposed
ot be used, and chache the various mac_xyz_use() calls before we
transition into the namespace, hence let's use the same verb for the
variables and the functions: "use"
Let's merge three if blocks that shall only run when sandboxing is applied
into one.
Note that this changes behaviour in one corner case: PrivateUsers=1 is
now honours both PermissionsStartOnly= and the "+" modifier in
ExecStart=, and not just the former, as before. This was an oversight,
so let's fix this now, at a point in time the option isn't used much
yet.
"Permissions" was a bit of a misnomer, as it suggests that UNIX file
permission bits are adjusted, which aren't really changed here. Instead,
this is about UNIX credentials such as users or groups, as well as
namespacing, hence let's use a more generic term here, without any
misleading reference to UNIX file permissions: "sandboxing", which shall
refer to all kinds of sandboxing technologies, including UID/GID
dropping, selinux relabelling, namespacing, seccomp, and so on.
Let's decouple the Manager object from the execution logic a bit more
here too, and simply pass along the fact whether we should
unconditionally chown the runtime/... directories via the ExecFlags
field too.
Let's try to decouple the execution engine a bit from the Unit/Manager
concept, and hence pass one more flag as part of the ExecParameters flags
field.
This new helper removes a leading /dev if there is one. We have code
doing this all over the place, let's unify this, and correct it while
we are at it, by using path_startswith() rather than startswith() to
drop the prefix.
/sys is not guaranteed to exist when a new mount namespace is created.
It is only mounted under conditions specified by
`namespace_info_mount_apivfs`.
Checking if the three available MAC LSMs are enabled requires a sysfs
mounted at /sys, so the checks are moved to before a new mount ns is
created.
When we create a log stream connection to journald, we pass along the
unit ID. With this change we do this only when we run as system
instance, not as user instance, to remove the ambiguity whether a user
or system unit is specified. The effect of this change is minor:
journald ignores the field anyway from clients with UID != 0. This patch
hence only fixes the unit attribution for the --user instance of the
root user.
This moves pretty much all uses of getpid() over to getpid_raw(). I
didn't specifically check whether the optimization is worth it for each
replacement, but in order to keep things simple and systematic I
switched over everything at once.
This introduces {State,Cache,Log,Configuration}Directory= those are
similar to RuntimeDirectory=. They create the directories under
/var/lib, /var/cache/, /var/log, or /etc, respectively, with the mode
specified in {State,Cache,Log,Configuration}DirectoryMode=.
This also fixes#6391.