The advantage is that is the name is mispellt, cpp will warn us.
$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build
squash! build-sys: use #if Y instead of #ifdef Y everywhere
v2:
- fix incorrect setting of HAVE_LIBIDN2
Let's optimize dynamic UID allocation a bit: if a StateDirectory= (or
suchlike) is configured, we start our allocation loop from that UID and
use it if it currently isn't used otherwise. This is beneficial as it
saves us from having to expensively recursively chown() these
directories in the typical case (which StateDirectory= does when it
notices that the owner of the directory doesn't match the UID picked).
With this in place we now have the a three-phase logic for allocating a
dynamic UID:
a) first, we try to use the owning UID of StateDirectory=,
CacheDirectory=, LogDirectory= if that exists and is currently
otherwise unused.
b) if that didn't work out, we hash the UID from the service name
c) if that didn't yield an unused UID either, randomly pick new ones
until we find a free one.
Let's clean up the interaction of StateDirectory= (and friends) to
DynamicUser=1: instead of creating these directories directly below
/var/lib, place them in /var/lib/private instead if DynamicUser=1 is
set, making that directory 0700 and owned by root:root. This way, if a
dynamic UID is later reused, access to the old run's state directory is
prohibited for that user. Then, use file system namespacing inside the
service to make /var/lib/private a readable tmpfs, hiding all state
directories that are not listed in StateDirectory=, and making access to
the actual state directory possible. Mount all directories listed in
StateDirectory= to the same places inside the service (which means
they'll now be mounted into the tmpfs instance). Finally, add a symlink
from the state directory name in /var/lib/ to the one in
/var/lib/private, so that both the host and the service can access the
path under the same location.
Here's an example: let's say a service runs with StateDirectory=foo.
When DynamicUser=0 is set, it will get the following setup, and no
difference between what the unit and what the host sees:
/var/lib/foo (created as directory)
Now, if DynamicUser=1 is set, we'll instead get this on the host:
/var/lib/private (created as directory with mode 0700, root:root)
/var/lib/private/foo (created as directory)
/var/lib/foo → private/foo (created as symlink)
And from inside the unit:
/var/lib/private (a tmpfs mount with mode 0755, root:root)
/var/lib/private/foo (bind mounted from the host)
/var/lib/foo → private/foo (the same symlink as above)
This takes inspiration from how container trees are protected below
/var/lib/machines: they generally reuse UIDs/GIDs of the host, but
because /var/lib/machines itself is set to 0700 host users cannot access
files in the container tree even if the UIDs/GIDs are reused. However,
for this commit we add one further trick: inside and outside of the unit
/var/lib/private is a different thing: outside it is a plain,
inaccessible directory, and inside it is a world-readable tmpfs mount
with only the whitelisted subdirs below it, bind mounte din. This
means, from the outside the dir acts as an access barrier, but from the
inside it does not. And the symlink created in /var/lib/foo itself
points across the barrier in both cases, so that root and the unit's
user always have access to these dirs without knowing the details of
this mounting magic.
This logic resolves a major shortcoming of DynamicUser=1 units:
previously they couldn't safely store persistant data. With this change
they can have their own private state, log and data directories, which
they can write to, but which are protected from UID recycling.
With this change, if RootDirectory= or RootImage= are used it is ensured
that the specified state/log/cache directories are always mounted in
from the host. This change of semantics I think is much preferable since
this means the root directory/image logic can be used easily for
read-only resource bundling (as all writable data resides outside of the
image). Note that this is a change of behaviour, but given that we
haven't released any systemd version with StateDirectory= and friends
implemented this should be a safe change to make (in particular as
previously it wasn't clear what would actually happen when used in
combination). Moreover, by making this change we can later add a "+"
modifier to these setings too working similar to the same modifier in
ReadOnlyPaths= and friends, making specified paths relative to the
container itself.
In most cases we followed the rule that the special _INVALID and _MAX
values we use in our enums use the full type name as prefix (in contrast
to regular values that we often make shorter), do so for
ExecDirectoryType as well.
No functional changes, just a little bit of renaming to make this code
more like the rest.
This is particularly useful when used in conjunction with DynamicUser=1,
where the UID might change for every invocation, but is useful in other
cases too, for example, when these directories are shared between
systems where the UID assignments differ slightly.
Just in case something opened them, let's make sure glibc invalidates
them too.
Thankfully so far no library opened log channels behind our back, at
least as far as I know, hence this is actually a NOP, but let's better
be safe than sorry.
Now that logging can implicitly reopen the log streams when needed we
can log errors without any special magic, hence let's normalize things,
and log the same way we do everywhere else.
This adds IOVEC_INIT() and IOVEC_MAKE() for initializing iovec structures
from a pointer and a size. On top of these IOVEC_INIT_STRING() and
IOVEC_MAKE_STRING() are added which take a string and automatically
determine the size of the string using strlen().
This patch removes the old IOVEC_SET_STRING() macro, given that
IOVEC_MAKE_STRING() is now useful for similar purposes. Note that the
old IOVEC_SET_STRING() invocations were two characters shorter than the
new ones using IOVEC_MAKE_STRING(), but I think the new syntax is more
readable and more generic as it simply resolves to a C99 literal
structure initialization. Moreover, we can use very similar syntax now
for initializing strings and pointer+size iovec entries. We canalso use
the new macros to initialize function parameters on-the-fly or array
definitions. And given that we shouldn't have so many ways to do the
same stuff, let's just settle on the new macros.
(This also converts some code to use _cleanup_ where dynamically
allocated strings were using IOVEC_SET_STRING() before, to modernize
things a bit)
Usually, it's a good thing that we isolate the kernel session keyring
for the various services and disconnect them from the user keyring.
However, in case of the cryptsetup key caching we actually want that
multiple instances of the cryptsetup service can share the keys in the
root user's user keyring, hence we need to be able to disable this logic
for them.
This adds KeyringMode=inherit|private|shared:
inherit: don't do any keyring magic (this is the default in systemd --user)
private: a private keyring as before (default in systemd --system)
shared: the new setting
If two separate log streams are connected to stdout and stderr, let's
make sure $JOURNAL_STREAM points to the latter, as that's the preferred
log destination, and the environment variable has been created in order
to permit services to automatically upgrade from stderr based logging to
native journal logging.
Also, document this behaviour.
Fixes: #6800
With this setting we can explicitly unset specific variables for
processes of a unit, as last step of assembling the environment block
for them. This is useful to fix#6407.
While we are at it, greatly expand the documentation on how the
environment block for forked off processes is assembled.
glibc appears to propagate different errors in different ways, let's fix
this up, so that our own code doesn't get confused by this.
See #6752 + #6737 for details.
Fixes: #6755
Let's clean up the checking for the various ExecOutput values a bit,
let's use IN_SET everywhere, and the same concepts for all three bools
we pass to dprintf().
Let's lock the personality to the currently set one, if nothing is
specifically specified. But do so with a grain of salt, and never
default to any exotic personality here, but only PER_LINUX or
PER_LINUX32.
Add LockPersonality boolean to allow locking down personality(2)
system call so that the execution domain can't be changed.
This may be useful to improve security because odd emulations
may be poorly tested and source of vulnerabilities, while
system services shouldn't need any weird personalities.
This patch adds two new special character prefixes to ExecStart= and
friends, in addition to the existing "-", "@" and "+":
"!" → much like "+", except with a much reduced effect as it only
disables the actual setresuid()/setresgid()/setgroups() calls, but
leaves all other security features on, including namespace
options. This is very useful in combination with
RuntimeDirectory= or DynamicUser= and similar option, as a user
is still allocated and used for the runtime directory, but the
actual UID/GID dropping is left to the daemon process itself.
This should make RuntimeDirectory= a lot more useful for daemons
which insist on doing their own privilege dropping.
"!!" → Similar to "!", but on systems supporting ambient caps this
becomes a NOP. This makes it relatively straightforward to write
unit files that make use of ambient capabilities to let systemd
drop all privs while retaining compatibility with systems that
lack ambient caps, where priv dropping is the left to the daemon
codes themselves.
This is an alternative approach to #6564 and related PRs.
These booleans simply store whether selinux/apparmor/smack are supposed
ot be used, and chache the various mac_xyz_use() calls before we
transition into the namespace, hence let's use the same verb for the
variables and the functions: "use"
Let's merge three if blocks that shall only run when sandboxing is applied
into one.
Note that this changes behaviour in one corner case: PrivateUsers=1 is
now honours both PermissionsStartOnly= and the "+" modifier in
ExecStart=, and not just the former, as before. This was an oversight,
so let's fix this now, at a point in time the option isn't used much
yet.
"Permissions" was a bit of a misnomer, as it suggests that UNIX file
permission bits are adjusted, which aren't really changed here. Instead,
this is about UNIX credentials such as users or groups, as well as
namespacing, hence let's use a more generic term here, without any
misleading reference to UNIX file permissions: "sandboxing", which shall
refer to all kinds of sandboxing technologies, including UID/GID
dropping, selinux relabelling, namespacing, seccomp, and so on.
Let's decouple the Manager object from the execution logic a bit more
here too, and simply pass along the fact whether we should
unconditionally chown the runtime/... directories via the ExecFlags
field too.
Let's try to decouple the execution engine a bit from the Unit/Manager
concept, and hence pass one more flag as part of the ExecParameters flags
field.
This new helper removes a leading /dev if there is one. We have code
doing this all over the place, let's unify this, and correct it while
we are at it, by using path_startswith() rather than startswith() to
drop the prefix.
/sys is not guaranteed to exist when a new mount namespace is created.
It is only mounted under conditions specified by
`namespace_info_mount_apivfs`.
Checking if the three available MAC LSMs are enabled requires a sysfs
mounted at /sys, so the checks are moved to before a new mount ns is
created.
When we create a log stream connection to journald, we pass along the
unit ID. With this change we do this only when we run as system
instance, not as user instance, to remove the ambiguity whether a user
or system unit is specified. The effect of this change is minor:
journald ignores the field anyway from clients with UID != 0. This patch
hence only fixes the unit attribution for the --user instance of the
root user.
This moves pretty much all uses of getpid() over to getpid_raw(). I
didn't specifically check whether the optimization is worth it for each
replacement, but in order to keep things simple and systematic I
switched over everything at once.
This introduces {State,Cache,Log,Configuration}Directory= those are
similar to RuntimeDirectory=. They create the directories under
/var/lib, /var/cache/, /var/log, or /etc, respectively, with the mode
specified in {State,Cache,Log,Configuration}DirectoryMode=.
This also fixes#6391.
Commit 74dd6b515f (core: run each system
service with a fresh session keyring) broke adding keys to user keyring.
Added keys could not be accessed with error message:
keyctl_read_alloc: Permission denied
So link the user keyring to our session keyring.
This patch is a bit more complex thant I hoped. In particular the single
IOScheduling= property exposed on the bus is split up into
IOSchedulingClass= and IOSchedulingPriority= (though compat is
retained). Otherwise the asymmetry between setting props and getting
them is a bit too nasty.
Fixes#5613
'n_fds' field in the ExecParameters structure was counting the total number of
file descriptors to be passed to a unit.
This counter also includes the number of passed socket fds which is counted by
'n_socket_fds' already.
This patch removes that redundancy by replacing 'n_fds' with
'n_storage_fds'. The new field only counts the fds passed via the storage store
mechanism. That way each fd is counted at one place only.
Subsequently the patch makes sure to fix code that used 'n_fds' and also wanted
to iterate through all of them by explicitly adding 'n_socket_fds' + 'n_storage_fds'.
Suggested by Lennart.
Make sure to only apply the O_NONBLOCK flag to the fds passed via socket
activation.
Previously the flag was also applied to the fds which came from the fd store
but this was incorrect since services, after being restarted, expect that these
passed fds have their flags unchanged and can be reused as before.
The documentation was a bit unclear about this so clarify it.
Till now if the params->n_fds was 0, systemd was logging that there were
more than one sockets.
Thanks @gregoryp and @VFXcode who did the most work debugging this.
This doesn't really matter much, only in case somebody would use
something strange like
EnvironmentFile=/etc/something/.*
Make sure that "." and ".." is not returned by that glob. This makes
all our globbing patterns behave the same.
log_struct takes multiple format strings, each one followed by arguments.
The _printf_ annotation is not sufficiently flexible to express this,
but we can still annotate the first format string, though not its
arguments (because their number is unknown).
With the annotation, the places which specified the message id or similar
as the first pattern cause a warning from -Wformat-nonliteral. This can
be trivially fixed by putting the MESSAGE= first.
This change will help find issues where a non-literal is erroneously used
as the pattern.
The MountAPIVFS= documentation says that this options has no effect
unless used in conjunction with RootDirectory= or RootImage= ,lets fix
this and avoid to create private mount namespaces where it is not
needed.
Embedding sd_id128_t's in constant strings was rather cumbersome. We had
SD_ID128_CONST_STR which returned a const char[], but it had two problems:
- it wasn't possible to statically concatanate this array with a normal string
- gcc wasn't really able to optimize this, and generated code to perform the
"conversion" at runtime.
Because of this, even our own code in coredumpctl wasn't using
SD_ID128_CONST_STR.
Add a new macro to generate a constant string: SD_ID128_MAKE_STR.
It is not as elegant as SD_ID128_CONST_STR, because it requires a repetition
of the numbers, but in practice it is more convenient to use, and allows gcc
to generate smarter code:
$ size .libs/systemd{,-logind,-journald}{.old,}
text data bss dec hex filename
1265204 149564 4808 1419576 15a938 .libs/systemd.old
1260268 149564 4808 1414640 1595f0 .libs/systemd
246805 13852 209 260866 3fb02 .libs/systemd-logind.old
240973 13852 209 255034 3e43a .libs/systemd-logind
146839 4984 34 151857 25131 .libs/systemd-journald.old
146391 4984 34 151409 24f71 .libs/systemd-journald
It is also much easier to check if a certain binary uses a certain MESSAGE_ID:
$ strings .libs/systemd.old|grep MESSAGE_ID
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
$ strings .libs/systemd|grep MESSAGE_ID
MESSAGE_ID=c7a787079b354eaaa9e77b371893cd27
MESSAGE_ID=b07a249cd024414a82dd00cd181378ff
MESSAGE_ID=641257651c1b4ec9a8624d7a40a9e1e7
MESSAGE_ID=de5b426a63be47a7b6ac3eaac82e2f6f
MESSAGE_ID=d34d037fff1847e6ae669a370e694725
MESSAGE_ID=7d4958e842da4a758f6c1cdc7b36dcc5
MESSAGE_ID=1dee0369c7fc4736b7099b38ecb46ee7
MESSAGE_ID=39f53479d3a045ac8e11786248231fbf
MESSAGE_ID=be02cf6855d2428ba40df7e9d022f03d
MESSAGE_ID=7b05ebc668384222baa8881179cfda54
MESSAGE_ID=9d1aaa27d60140bd96365438aad20286
ReadOnlyPaths=, ProtectHome=, InaccessiblePaths= and ProtectSystem= are
about restricting access and little more, hence they should be disabled
if PermissionsStartOnly= is used or ExecStart= lines are prefixed with a
"+". Do that.
(Note that we will still create namespaces and stuff, since that's about
a lot more than just permissions. We'll simply disable the effect of
the four options mentioned above, but nothing else mount related.)
This also adds a test for this, to ensure this works as intended.
No documentation updates, as the documentation are already vague enough
to support the new behaviour ("If true, the permission-related execution
options…"). We could clarify this further, but I think we might want to
extend the switches' behaviour a bit more in future, hence leave it at
this for now.
Fixes: #5308
Or actually, try to to do the right thing depending on what is
available:
- If we know $HOME from User=, then use that.
- If the UID for the service is 0, hardcode that WorkingDirectory=~ means WorkingDirectory=/root
- In any other case (which will be the unprivileged --user case), use
get_home_dir() to find the $HOME of the user we are running as.
- Otherwise fail.
Fixes: #5246#5124
This is similar to RootDirectory= but mounts the root file system from a
block device or loopback file instead of another directory.
This reuses the image dissector code now used by nspawn and
gpt-auto-discovery.
This adds a boolean unit file setting MountAPIVFS=. If set, the three
main API VFS mounts will be mounted for the service. This only has an
effect on RootDirectory=, which it makes a ton times more useful.
(This is basically the /dev + /proc + /sys mounting code posted in the
original #4727, but rebased on current git, and with the automatic logic
replaced by explicit logic controlled by a unit file setting)
Before previous commit, username would be NULL for root, and set only
for other users. So the argument passed to utmp_put_init_process()
would be "root" for other users and NULL for root. Seems strange.
Instead, always pass the username if available.
This changes the environment for services running as root from:
LANG=C.utf8
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
INVOCATION_ID=ffbdec203c69499a9b83199333e31555
JOURNAL_STREAM=8:1614518
to
LANG=C.utf8
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
HOME=/root
LOGNAME=root
USER=root
SHELL=/bin/sh
INVOCATION_ID=15a077963d7b4ca0b82c91dc6519f87c
JOURNAL_STREAM=8:1616718
Making the environment special for the root user complicates things
unnecessarily. This change simplifies both our logic (by making the setting
of the variables unconditional), and should also simplify the logic in
services (particularly scripts).
Fixes#5124.
This substantially reworks the seccomp code, to ensure better
compatibility with some architectures, including i386.
So far we relied on libseccomp's internal handling of the multiple
syscall ABIs supported on Linux. This is problematic however, as it does
not define clear semantics if an ABI is not able to support specific
seccomp rules we install.
This rework hence changes a couple of things:
- We no longer use seccomp_rule_add(), but only
seccomp_rule_add_exact(), and fail the installation of a filter if the
architecture doesn't support it.
- We no longer rely on adding multiple syscall architectures to a single filter,
but instead install a separate filter for each syscall architecture
supported. This way, we can install a strict filter for x86-64, while
permitting a less strict filter for i386.
- All high-level filter additions are now moved from execute.c to
seccomp-util.c, so that we can test them independently of the service
execution logic.
- Tests have been added for all types of our seccomp filters.
- SystemCallFilters= and SystemCallArchitectures= are now implemented in
independent filters and installation logic, as they semantically are
very much independent of each other.
Fixes: #4575
This adds two new settings BindPaths= and BindReadOnlyPaths=. They allow
defining arbitrary bind mounts specific to particular services. This is
particularly useful for services with RootDirectory= set as this permits making
specific bits of the host directory available to chrooted services.
The two new settings follow the concepts nspawn already possess in --bind= and
--bind-ro=, as well as the .nspawn settings Bind= and BindReadOnly= (and these
latter options should probably be renamed to BindPaths= and BindReadOnlyPaths=
too).
Fixes: #3439
Let's store the invocation ID in the per-service keyring as a root-owned key,
with strict access rights. This has the advantage over the environment-based ID
passing that it also works from SUID binaries (as they key cannot be overidden
by unprivileged code starting them), in contrast to the secure_getenv() based
mode.
The invocation ID is now passed in three different ways to a service:
- As environment variable $INVOCATION_ID. This is easy to use, but may be
overriden by unprivileged code (which might be a bad or a good thing), which
means it's incompatible with SUID code (see above).
- As extended attribute on the service cgroup. This cannot be overriden by
unprivileged code, and may be queried safely from "outside" of a service.
However, it is incompatible with containers right now, as unprivileged
containers generally cannot set xattrs on cgroupfs.
- As "invocation_id" key in the kernel keyring. This has the benefit that the
key cannot be changed by unprivileged service code, and thus is safe to
access from SUID code (see above). But do note that service code can replace
the session keyring with a fresh one that lacks the key. However in that case
the key will not be owned by root, which is easily detectable. The keyring is
also incompatible with containers right now, as it is not properly namespace
aware (but this is being worked on), and thus most container managers mask
the keyring-related system calls.
Ideally we'd only have one way to pass the invocation ID, but the different
ways all have limitations. The invocation ID hookup in journald is currently
only available on the host but not in containers, due to the mentioned
limitations.
How to verify the new invocation ID in the keyring:
# systemd-run -t /bin/sh
Running as unit: run-rd917366c04f847b480d486017f7239d6.service
Press ^] three times within 1s to disconnect TTY.
# keyctl show
Session Keyring
680208392 --alswrv 0 0 keyring: _ses
250926536 ----s-rv 0 0 \_ user: invocation_id
# keyctl request user invocation_id
250926536
# keyctl read 250926536
16 bytes of data in key:
9c96317c ac64495a a42b9cd7 4f3ff96b
# echo $INVOCATION_ID
9c96317cac64495aa42b9cd74f3ff96b
# ^D
This creates a new transient service runnint a shell. Then verifies the
contents of the keyring, requests the invocation ID key, and reads its payload.
For comparison the invocation ID as passed via the environment variable is also
displayed.
This patch ensures that each system service gets its own session kernel keyring
automatically, and implicitly. Without this a keyring is allocated for it
on-demand, but is then linked with the user's kernel keyring, which is OK
behaviour for logged in users, but not so much for system services.
With this change each service gets a session keyring that is specific to the
service and ceases to exist when the service is shut down. The session keyring
is not linked up with the user keyring and keys hence only search within the
session boundaries by default.
(This is useful in a later commit to store per-service material in the keyring,
for example the invocation ID)
(With input from David Howells)
For some reasons units remaining in the same process group as PID 1
(same_pgrp=true) fail to acquire the console even if it's not taken by anyone.
So always accept for units with same_pgrp set for now.
Previously it was "[Yes, Fail, Skip]" which is pretty misleading because it
suggests that the whole word needs to be entered instead of a single char.
Also this won't fit well when we'll extend the number of choices.
This patch addresses this by changing the choice hint with "[y, f, s – h for help]"
so it's now clear that a single letter has to be entered.
It also introduces a new choice 'h' which describes all possible choices since
a single letter can be not descriptive enough for new users.
It also allow to stick with the same hint string regardless of how
many choices we will support.
When "confirmation_spawn=1", the confirmation question can look like:
Execute /usr/bin/kmod static-nodes --format=tmpfiles --output=/run/tmpfiles.d/kmod.conf? [Yes, No, Skip]
which is pretty verbose and might not fit in the console width size (which is
usually 80 chars) and thus question will be splitted into 2 consecutive lines.
However since the question is now refreshed every 2 secs, the reprinted
question will overwrite the second line of the previous one...
To prevent this, this patch makes sure that the command line won't be longer
than 60 chars by ellipsizing it if the command is longer:
Execute /usr/bin/kmod static-nodes --format=tmpfiles --output=/ru…nf? [Yes, No, View, Skip]
A following patch will introduce a new choice that will allow the user to get
details on the command to be executed so it will still be possible to see the
full command line.
Before this patch we had:
- "no" which gives "failing execution" but the command is actually assumed as
succeed.
- "skip" which gives "skipping", but the command is assumed to have failed,
which ends up with "Failed to start ..." on the console.
Now we have:
- "fail" which gives "failing execution" and the command is indeed assumed as
failed.
- "skip" which gives "skipping execution" and the command is assumed as
succeed.
Now the reponses are handled by ask_for_confirmation() as well as the report of
any errors occuring during the process of retrieving the confirmation response.
One benefit of this is that there's no need to open/close the console one more
time when reporting error/status messages.
The caller now just needs to care about the return values whose meanings are:
- don't execute and pretend that the command failed
- don't execute and pretend that the command succeeed
- positive answer, execute the command
Also some slight code reorganization and introduce write_confirm_error() and
write_confirm_error_fd(). write_confim_message becomes unneeded.
It's rather hard to parse the confirmation messages (enabled with
systemd.confirm_spawn=true) amongst the status messages and the kernel
ones (if enabled).
This patch gives the possibility to the user to redirect the confirmation
message to a different virtual console, either by giving its name or its path,
so those messages are separated from the other ones and easier to read.
The no_new_privileged_set variable is not used any more since commit
9b232d3241 that fixed another thing. So remove it. Also no
need to check if we are under user manager, remove that part too.
This adds a variable that is always set to false to make sure that
protect paths inside sandbox are always enforced and not ignored. The only
case when it is set to true is on DynamicUser=no and RootDirectory=/chroot
is set. This allows users to use more our sandbox features inside RootDirectory=
The only exception is ProtectSystem=full|strict and when DynamicUser=yes
is implied. Currently RootDirectory= is not fully compatible with these
due to two reasons:
* /chroot/usr|etc has to be present on ProtectSystem=full
* /chroot// has to be a mount point on ProtectSystem=strict.
core: add new RestrictNamespaces= unit file setting
Merging, not rebasing, because this touches many files and there were tree-wide cleanups in the mean time.
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().
RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.
This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
Make sure that when DynamicUser= is set that we intialize the user
supplementary groups and that we also support SupplementaryGroups=
Fixes: https://github.com/systemd/systemd/issues/4539
Thanks Evgeny Vereshchagin (@evverx)
Seccomp is generally an unprivileged operation, changing security contexts is
most likely associated with some form of policy. Moreover, while seccomp may
influence our own flow of code quite a bit (much more than the security context
change) make sure to apply the seccomp filters immediately before executing the
binary to invoke.
This also moves enforcement of NNP after the security context change, so that
NNP cannot affect it anymore. (However, the security policy now has to permit
the NNP change).
This change has a good chance of breaking current SELinux/AA/SMACK setups, because
the policy might not expect this change of behaviour. However, it's technically
the better choice I think and should hence be applied.
Fixes: #3993
This makes applying groups after applying the working directory, this
may allow some flexibility but at same it is not a big deal since we
don't execute or do anything between applying working directory and
droping groups.
This adds a new seccomp_init_conservative() helper call that is mostly just a
wrapper around seccomp_init(), but turns off NNP and adds in all secondary
archs, for best compatibility with everything else.
Pretty much all of our code used the very same constructs for these three
steps, hence unifying this in one small function makes things a lot shorter.
This also changes incorrect usage of the "scmp_filter_ctx" type at various
places. libseccomp defines it as typedef to "void*", i.e. it is a pointer type
(pretty poor choice already!) that casts implicitly to and from all other
pointer types (even poorer choice: you defined a confusing type now, and don't
even gain any bit of type safety through it...). A lot of the code assumed the
type would refer to a structure, and hence aded additional "*" here and there.
Remove that.
Let's simplify this call, by making use of the new infrastructure.
This is actually more in line with Djalal's original patch but instead of
search the filter set in the array by its name we can now use the set index and
jump directly to it.
A variety of fixes:
- rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main
instance of it (the syscall_filter_sets[] array) used to abbreviate
"SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not
mix and match too wildly. Let's pick the shorter name in this case, as it is
sufficiently well established to not confuse hackers reading this.
- Export explicit indexes into the syscall_filter_sets[] array via an enum.
This way, code that wants to make use of a specific filter set, can index it
directly via the enum, instead of having to search for it. This makes
apply_private_devices() in particular a lot simpler.
- Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to
find a set by its name, seccomp_add_syscall_filter_set() to add a set to a
seccomp object.
- Update SystemCallFilter= parser to use extract_first_word(). Let's work on
deprecating FOREACH_WORD_QUOTED().
- Simplify apply_private_devices() using this functionality
Remove the assert and check the return code of sysconf(_SC_NGROUPS_MAX).
_SC_NGROUPS_MAX maps to NGROUPS_MAX which is defined in <limits.h> to
65536 these days. The value is a sysctl read-only
/proc/sys/kernel/ngroups_max and the kernel assumes that it is always
positive otherwise things may break. Follow this and support only
positive values for all other case return either -errno or -EOPNOTSUPP.
Now if there are systems that want to re-write NGROUPS_MAX then they
should not pass SupplementaryGroups= in units even if it is empty, in
this case nothing fails and we just ignore supplementary groups. However
if SupplementaryGroups= is passed even if it is empty we have to assume
that there will be groups manipulation from our side or the kernel and
since the kernel always assumes that NGROUPS_MAX is positive, then
follow that and support only positive values.
This commit adds a `fd` option to `StandardInput=`,
`StandardOutput=` and `StandardError=` properties in order to
connect standard streams to externally named descriptors provided
by some socket units.
This option looks for a file descriptor named as the corresponding
stream. Custom names can be specified, separated by a colon.
If multiple name-matches exist, the first matching fd will be used.
Lets go further and make /lib/modules/ inaccessible for services that do
not have business with modules, this is a minor improvment but it may
help on setups with custom modules and they are limited... in regard of
kernel auto-load feature.
This change introduce NameSpaceInfo struct which we may embed later
inside ExecContext but for now lets just reduce the argument number to
setup_namespace() and merge ProtectKernelModules feature.
This is useful to turn off explicit module load and unload operations on modular
kernels. This option removes CAP_SYS_MODULE from the capability bounding set for
the unit, and installs a system call filter to block module system calls.
This option will not prevent the kernel from loading modules using the module
auto-load feature which is a system wide operation.
If stdin is supplied as an fd for transient units (using the
StandardInputFileDescriptor pseudo-property for transient units), then we
should also fix up the TTY ownership, not just when we opened the TTY
ourselves.
This simply drops the explicit is_terminal_input()-based check. Note that
chown_terminal() internally does a much more appropriate isatty()-based check
anyway, hence we can drop this without replacement.
Fixes: #4260
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.
The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.
The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.
The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.
Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.
This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().
PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.
A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.
Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
Let's drop the caching of the setgroups /proc field for now. While there's a
strict regime in place when it changes states, let's better not cache it since
we cannot really be sure we follow that regime correctly.
More importantly however, this is not in performance sensitive code, and
there's no indication the cache is really beneficial, hence let's drop the
caching and make things a bit simpler.
Also, while we are at it, rework the error handling a bit, and always return
negative errno-style error codes, following our usual coding style. This has
the benefit that we can sensible hanld read_one_line_file() errors, without
having to updat errno explicitly.
In the process execution code of PID 1, before
096424d123 the GID settings where changed before
invoking PAM, and the UID settings after. After the change both changes are
made after the PAM session hooks are run. When invoking PAM we fork once, and
leave a stub process around which will invoke the PAM session end hooks when
the session goes away. This code previously was dropping the remaining privs
(which were precisely the UID). Fix this code to do this correctly again, by
really dropping them else (i.e. the GID as well).
While we are at it, also fix error logging of this code.
Fixes: #4238
If device access is restricted via PrivateDevices=, let's also block the
various low-level I/O syscalls at the same time, so that we know that the
minimal set of devices in our virtualized /dev are really everything the unit
can access.
If PrivateDevices=yes is set, the namespace code creates device nodes in /dev
that should be owned by the host's root, hence let's make sure we set up the
namespace before dropping group privileges.
This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
In https://github.com/systemd/systemd/pull/4004 , a runtime detection
method for seccomp was added. However, it does not detect the case
where CONFIG_SECCOMP=y but CONFIG_SECCOMP_FILTER=n. This is possible
if the architecture does not support filtering yet.
Add a check for that case too.
While at it, change get_proc_field usage to use PR_GET_SECCOMP prctl,
as that should save a few system calls and (unnecessary) allocations.
Previously, reading of /proc/self/stat was done as recommended by
prctl(2) as safer. However, given that we need to do the prctl call
anyway, lets skip opening, reading and parsing the file.
Code for checking inspired by
https://outflux.net/teach-seccomp/autodetect.html