Commit 74dd6b515f (core: run each system
service with a fresh session keyring) broke adding keys to user keyring.
Added keys could not be accessed with error message:
keyctl_read_alloc: Permission denied
So link the user keyring to our session keyring.
This patch is a bit more complex thant I hoped. In particular the single
IOScheduling= property exposed on the bus is split up into
IOSchedulingClass= and IOSchedulingPriority= (though compat is
retained). Otherwise the asymmetry between setting props and getting
them is a bit too nasty.
Fixes#5613
'n_fds' field in the ExecParameters structure was counting the total number of
file descriptors to be passed to a unit.
This counter also includes the number of passed socket fds which is counted by
'n_socket_fds' already.
This patch removes that redundancy by replacing 'n_fds' with
'n_storage_fds'. The new field only counts the fds passed via the storage store
mechanism. That way each fd is counted at one place only.
Subsequently the patch makes sure to fix code that used 'n_fds' and also wanted
to iterate through all of them by explicitly adding 'n_socket_fds' + 'n_storage_fds'.
Suggested by Lennart.
Make sure to only apply the O_NONBLOCK flag to the fds passed via socket
activation.
Previously the flag was also applied to the fds which came from the fd store
but this was incorrect since services, after being restarted, expect that these
passed fds have their flags unchanged and can be reused as before.
The documentation was a bit unclear about this so clarify it.
Till now if the params->n_fds was 0, systemd was logging that there were
more than one sockets.
Thanks @gregoryp and @VFXcode who did the most work debugging this.
This doesn't really matter much, only in case somebody would use
something strange like
EnvironmentFile=/etc/something/.*
Make sure that "." and ".." is not returned by that glob. This makes
all our globbing patterns behave the same.
log_struct takes multiple format strings, each one followed by arguments.
The _printf_ annotation is not sufficiently flexible to express this,
but we can still annotate the first format string, though not its
arguments (because their number is unknown).
With the annotation, the places which specified the message id or similar
as the first pattern cause a warning from -Wformat-nonliteral. This can
be trivially fixed by putting the MESSAGE= first.
This change will help find issues where a non-literal is erroneously used
as the pattern.
The MountAPIVFS= documentation says that this options has no effect
unless used in conjunction with RootDirectory= or RootImage= ,lets fix
this and avoid to create private mount namespaces where it is not
needed.
Embedding sd_id128_t's in constant strings was rather cumbersome. We had
SD_ID128_CONST_STR which returned a const char[], but it had two problems:
- it wasn't possible to statically concatanate this array with a normal string
- gcc wasn't really able to optimize this, and generated code to perform the
"conversion" at runtime.
Because of this, even our own code in coredumpctl wasn't using
SD_ID128_CONST_STR.
Add a new macro to generate a constant string: SD_ID128_MAKE_STR.
It is not as elegant as SD_ID128_CONST_STR, because it requires a repetition
of the numbers, but in practice it is more convenient to use, and allows gcc
to generate smarter code:
$ size .libs/systemd{,-logind,-journald}{.old,}
text data bss dec hex filename
1265204 149564 4808 1419576 15a938 .libs/systemd.old
1260268 149564 4808 1414640 1595f0 .libs/systemd
246805 13852 209 260866 3fb02 .libs/systemd-logind.old
240973 13852 209 255034 3e43a .libs/systemd-logind
146839 4984 34 151857 25131 .libs/systemd-journald.old
146391 4984 34 151409 24f71 .libs/systemd-journald
It is also much easier to check if a certain binary uses a certain MESSAGE_ID:
$ strings .libs/systemd.old|grep MESSAGE_ID
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
$ strings .libs/systemd|grep MESSAGE_ID
MESSAGE_ID=c7a787079b354eaaa9e77b371893cd27
MESSAGE_ID=b07a249cd024414a82dd00cd181378ff
MESSAGE_ID=641257651c1b4ec9a8624d7a40a9e1e7
MESSAGE_ID=de5b426a63be47a7b6ac3eaac82e2f6f
MESSAGE_ID=d34d037fff1847e6ae669a370e694725
MESSAGE_ID=7d4958e842da4a758f6c1cdc7b36dcc5
MESSAGE_ID=1dee0369c7fc4736b7099b38ecb46ee7
MESSAGE_ID=39f53479d3a045ac8e11786248231fbf
MESSAGE_ID=be02cf6855d2428ba40df7e9d022f03d
MESSAGE_ID=7b05ebc668384222baa8881179cfda54
MESSAGE_ID=9d1aaa27d60140bd96365438aad20286
ReadOnlyPaths=, ProtectHome=, InaccessiblePaths= and ProtectSystem= are
about restricting access and little more, hence they should be disabled
if PermissionsStartOnly= is used or ExecStart= lines are prefixed with a
"+". Do that.
(Note that we will still create namespaces and stuff, since that's about
a lot more than just permissions. We'll simply disable the effect of
the four options mentioned above, but nothing else mount related.)
This also adds a test for this, to ensure this works as intended.
No documentation updates, as the documentation are already vague enough
to support the new behaviour ("If true, the permission-related execution
options…"). We could clarify this further, but I think we might want to
extend the switches' behaviour a bit more in future, hence leave it at
this for now.
Fixes: #5308
Or actually, try to to do the right thing depending on what is
available:
- If we know $HOME from User=, then use that.
- If the UID for the service is 0, hardcode that WorkingDirectory=~ means WorkingDirectory=/root
- In any other case (which will be the unprivileged --user case), use
get_home_dir() to find the $HOME of the user we are running as.
- Otherwise fail.
Fixes: #5246#5124
This is similar to RootDirectory= but mounts the root file system from a
block device or loopback file instead of another directory.
This reuses the image dissector code now used by nspawn and
gpt-auto-discovery.
This adds a boolean unit file setting MountAPIVFS=. If set, the three
main API VFS mounts will be mounted for the service. This only has an
effect on RootDirectory=, which it makes a ton times more useful.
(This is basically the /dev + /proc + /sys mounting code posted in the
original #4727, but rebased on current git, and with the automatic logic
replaced by explicit logic controlled by a unit file setting)
Before previous commit, username would be NULL for root, and set only
for other users. So the argument passed to utmp_put_init_process()
would be "root" for other users and NULL for root. Seems strange.
Instead, always pass the username if available.
This changes the environment for services running as root from:
LANG=C.utf8
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
INVOCATION_ID=ffbdec203c69499a9b83199333e31555
JOURNAL_STREAM=8:1614518
to
LANG=C.utf8
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
HOME=/root
LOGNAME=root
USER=root
SHELL=/bin/sh
INVOCATION_ID=15a077963d7b4ca0b82c91dc6519f87c
JOURNAL_STREAM=8:1616718
Making the environment special for the root user complicates things
unnecessarily. This change simplifies both our logic (by making the setting
of the variables unconditional), and should also simplify the logic in
services (particularly scripts).
Fixes#5124.
This substantially reworks the seccomp code, to ensure better
compatibility with some architectures, including i386.
So far we relied on libseccomp's internal handling of the multiple
syscall ABIs supported on Linux. This is problematic however, as it does
not define clear semantics if an ABI is not able to support specific
seccomp rules we install.
This rework hence changes a couple of things:
- We no longer use seccomp_rule_add(), but only
seccomp_rule_add_exact(), and fail the installation of a filter if the
architecture doesn't support it.
- We no longer rely on adding multiple syscall architectures to a single filter,
but instead install a separate filter for each syscall architecture
supported. This way, we can install a strict filter for x86-64, while
permitting a less strict filter for i386.
- All high-level filter additions are now moved from execute.c to
seccomp-util.c, so that we can test them independently of the service
execution logic.
- Tests have been added for all types of our seccomp filters.
- SystemCallFilters= and SystemCallArchitectures= are now implemented in
independent filters and installation logic, as they semantically are
very much independent of each other.
Fixes: #4575
This adds two new settings BindPaths= and BindReadOnlyPaths=. They allow
defining arbitrary bind mounts specific to particular services. This is
particularly useful for services with RootDirectory= set as this permits making
specific bits of the host directory available to chrooted services.
The two new settings follow the concepts nspawn already possess in --bind= and
--bind-ro=, as well as the .nspawn settings Bind= and BindReadOnly= (and these
latter options should probably be renamed to BindPaths= and BindReadOnlyPaths=
too).
Fixes: #3439
Let's store the invocation ID in the per-service keyring as a root-owned key,
with strict access rights. This has the advantage over the environment-based ID
passing that it also works from SUID binaries (as they key cannot be overidden
by unprivileged code starting them), in contrast to the secure_getenv() based
mode.
The invocation ID is now passed in three different ways to a service:
- As environment variable $INVOCATION_ID. This is easy to use, but may be
overriden by unprivileged code (which might be a bad or a good thing), which
means it's incompatible with SUID code (see above).
- As extended attribute on the service cgroup. This cannot be overriden by
unprivileged code, and may be queried safely from "outside" of a service.
However, it is incompatible with containers right now, as unprivileged
containers generally cannot set xattrs on cgroupfs.
- As "invocation_id" key in the kernel keyring. This has the benefit that the
key cannot be changed by unprivileged service code, and thus is safe to
access from SUID code (see above). But do note that service code can replace
the session keyring with a fresh one that lacks the key. However in that case
the key will not be owned by root, which is easily detectable. The keyring is
also incompatible with containers right now, as it is not properly namespace
aware (but this is being worked on), and thus most container managers mask
the keyring-related system calls.
Ideally we'd only have one way to pass the invocation ID, but the different
ways all have limitations. The invocation ID hookup in journald is currently
only available on the host but not in containers, due to the mentioned
limitations.
How to verify the new invocation ID in the keyring:
# systemd-run -t /bin/sh
Running as unit: run-rd917366c04f847b480d486017f7239d6.service
Press ^] three times within 1s to disconnect TTY.
# keyctl show
Session Keyring
680208392 --alswrv 0 0 keyring: _ses
250926536 ----s-rv 0 0 \_ user: invocation_id
# keyctl request user invocation_id
250926536
# keyctl read 250926536
16 bytes of data in key:
9c96317c ac64495a a42b9cd7 4f3ff96b
# echo $INVOCATION_ID
9c96317cac64495aa42b9cd74f3ff96b
# ^D
This creates a new transient service runnint a shell. Then verifies the
contents of the keyring, requests the invocation ID key, and reads its payload.
For comparison the invocation ID as passed via the environment variable is also
displayed.
This patch ensures that each system service gets its own session kernel keyring
automatically, and implicitly. Without this a keyring is allocated for it
on-demand, but is then linked with the user's kernel keyring, which is OK
behaviour for logged in users, but not so much for system services.
With this change each service gets a session keyring that is specific to the
service and ceases to exist when the service is shut down. The session keyring
is not linked up with the user keyring and keys hence only search within the
session boundaries by default.
(This is useful in a later commit to store per-service material in the keyring,
for example the invocation ID)
(With input from David Howells)
For some reasons units remaining in the same process group as PID 1
(same_pgrp=true) fail to acquire the console even if it's not taken by anyone.
So always accept for units with same_pgrp set for now.
Previously it was "[Yes, Fail, Skip]" which is pretty misleading because it
suggests that the whole word needs to be entered instead of a single char.
Also this won't fit well when we'll extend the number of choices.
This patch addresses this by changing the choice hint with "[y, f, s – h for help]"
so it's now clear that a single letter has to be entered.
It also introduces a new choice 'h' which describes all possible choices since
a single letter can be not descriptive enough for new users.
It also allow to stick with the same hint string regardless of how
many choices we will support.
When "confirmation_spawn=1", the confirmation question can look like:
Execute /usr/bin/kmod static-nodes --format=tmpfiles --output=/run/tmpfiles.d/kmod.conf? [Yes, No, Skip]
which is pretty verbose and might not fit in the console width size (which is
usually 80 chars) and thus question will be splitted into 2 consecutive lines.
However since the question is now refreshed every 2 secs, the reprinted
question will overwrite the second line of the previous one...
To prevent this, this patch makes sure that the command line won't be longer
than 60 chars by ellipsizing it if the command is longer:
Execute /usr/bin/kmod static-nodes --format=tmpfiles --output=/ru…nf? [Yes, No, View, Skip]
A following patch will introduce a new choice that will allow the user to get
details on the command to be executed so it will still be possible to see the
full command line.
Before this patch we had:
- "no" which gives "failing execution" but the command is actually assumed as
succeed.
- "skip" which gives "skipping", but the command is assumed to have failed,
which ends up with "Failed to start ..." on the console.
Now we have:
- "fail" which gives "failing execution" and the command is indeed assumed as
failed.
- "skip" which gives "skipping execution" and the command is assumed as
succeed.
Now the reponses are handled by ask_for_confirmation() as well as the report of
any errors occuring during the process of retrieving the confirmation response.
One benefit of this is that there's no need to open/close the console one more
time when reporting error/status messages.
The caller now just needs to care about the return values whose meanings are:
- don't execute and pretend that the command failed
- don't execute and pretend that the command succeeed
- positive answer, execute the command
Also some slight code reorganization and introduce write_confirm_error() and
write_confirm_error_fd(). write_confim_message becomes unneeded.
It's rather hard to parse the confirmation messages (enabled with
systemd.confirm_spawn=true) amongst the status messages and the kernel
ones (if enabled).
This patch gives the possibility to the user to redirect the confirmation
message to a different virtual console, either by giving its name or its path,
so those messages are separated from the other ones and easier to read.
The no_new_privileged_set variable is not used any more since commit
9b232d3241 that fixed another thing. So remove it. Also no
need to check if we are under user manager, remove that part too.
This adds a variable that is always set to false to make sure that
protect paths inside sandbox are always enforced and not ignored. The only
case when it is set to true is on DynamicUser=no and RootDirectory=/chroot
is set. This allows users to use more our sandbox features inside RootDirectory=
The only exception is ProtectSystem=full|strict and when DynamicUser=yes
is implied. Currently RootDirectory= is not fully compatible with these
due to two reasons:
* /chroot/usr|etc has to be present on ProtectSystem=full
* /chroot// has to be a mount point on ProtectSystem=strict.
core: add new RestrictNamespaces= unit file setting
Merging, not rebasing, because this touches many files and there were tree-wide cleanups in the mean time.
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().
RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.
This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
Make sure that when DynamicUser= is set that we intialize the user
supplementary groups and that we also support SupplementaryGroups=
Fixes: https://github.com/systemd/systemd/issues/4539
Thanks Evgeny Vereshchagin (@evverx)
Seccomp is generally an unprivileged operation, changing security contexts is
most likely associated with some form of policy. Moreover, while seccomp may
influence our own flow of code quite a bit (much more than the security context
change) make sure to apply the seccomp filters immediately before executing the
binary to invoke.
This also moves enforcement of NNP after the security context change, so that
NNP cannot affect it anymore. (However, the security policy now has to permit
the NNP change).
This change has a good chance of breaking current SELinux/AA/SMACK setups, because
the policy might not expect this change of behaviour. However, it's technically
the better choice I think and should hence be applied.
Fixes: #3993
This makes applying groups after applying the working directory, this
may allow some flexibility but at same it is not a big deal since we
don't execute or do anything between applying working directory and
droping groups.
This adds a new seccomp_init_conservative() helper call that is mostly just a
wrapper around seccomp_init(), but turns off NNP and adds in all secondary
archs, for best compatibility with everything else.
Pretty much all of our code used the very same constructs for these three
steps, hence unifying this in one small function makes things a lot shorter.
This also changes incorrect usage of the "scmp_filter_ctx" type at various
places. libseccomp defines it as typedef to "void*", i.e. it is a pointer type
(pretty poor choice already!) that casts implicitly to and from all other
pointer types (even poorer choice: you defined a confusing type now, and don't
even gain any bit of type safety through it...). A lot of the code assumed the
type would refer to a structure, and hence aded additional "*" here and there.
Remove that.
Let's simplify this call, by making use of the new infrastructure.
This is actually more in line with Djalal's original patch but instead of
search the filter set in the array by its name we can now use the set index and
jump directly to it.
A variety of fixes:
- rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main
instance of it (the syscall_filter_sets[] array) used to abbreviate
"SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not
mix and match too wildly. Let's pick the shorter name in this case, as it is
sufficiently well established to not confuse hackers reading this.
- Export explicit indexes into the syscall_filter_sets[] array via an enum.
This way, code that wants to make use of a specific filter set, can index it
directly via the enum, instead of having to search for it. This makes
apply_private_devices() in particular a lot simpler.
- Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to
find a set by its name, seccomp_add_syscall_filter_set() to add a set to a
seccomp object.
- Update SystemCallFilter= parser to use extract_first_word(). Let's work on
deprecating FOREACH_WORD_QUOTED().
- Simplify apply_private_devices() using this functionality
Remove the assert and check the return code of sysconf(_SC_NGROUPS_MAX).
_SC_NGROUPS_MAX maps to NGROUPS_MAX which is defined in <limits.h> to
65536 these days. The value is a sysctl read-only
/proc/sys/kernel/ngroups_max and the kernel assumes that it is always
positive otherwise things may break. Follow this and support only
positive values for all other case return either -errno or -EOPNOTSUPP.
Now if there are systems that want to re-write NGROUPS_MAX then they
should not pass SupplementaryGroups= in units even if it is empty, in
this case nothing fails and we just ignore supplementary groups. However
if SupplementaryGroups= is passed even if it is empty we have to assume
that there will be groups manipulation from our side or the kernel and
since the kernel always assumes that NGROUPS_MAX is positive, then
follow that and support only positive values.
This commit adds a `fd` option to `StandardInput=`,
`StandardOutput=` and `StandardError=` properties in order to
connect standard streams to externally named descriptors provided
by some socket units.
This option looks for a file descriptor named as the corresponding
stream. Custom names can be specified, separated by a colon.
If multiple name-matches exist, the first matching fd will be used.
Lets go further and make /lib/modules/ inaccessible for services that do
not have business with modules, this is a minor improvment but it may
help on setups with custom modules and they are limited... in regard of
kernel auto-load feature.
This change introduce NameSpaceInfo struct which we may embed later
inside ExecContext but for now lets just reduce the argument number to
setup_namespace() and merge ProtectKernelModules feature.
This is useful to turn off explicit module load and unload operations on modular
kernels. This option removes CAP_SYS_MODULE from the capability bounding set for
the unit, and installs a system call filter to block module system calls.
This option will not prevent the kernel from loading modules using the module
auto-load feature which is a system wide operation.
If stdin is supplied as an fd for transient units (using the
StandardInputFileDescriptor pseudo-property for transient units), then we
should also fix up the TTY ownership, not just when we opened the TTY
ourselves.
This simply drops the explicit is_terminal_input()-based check. Note that
chown_terminal() internally does a much more appropriate isatty()-based check
anyway, hence we can drop this without replacement.
Fixes: #4260
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.
The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.
The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.
The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.
Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.
This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().
PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.
A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.
Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
Let's drop the caching of the setgroups /proc field for now. While there's a
strict regime in place when it changes states, let's better not cache it since
we cannot really be sure we follow that regime correctly.
More importantly however, this is not in performance sensitive code, and
there's no indication the cache is really beneficial, hence let's drop the
caching and make things a bit simpler.
Also, while we are at it, rework the error handling a bit, and always return
negative errno-style error codes, following our usual coding style. This has
the benefit that we can sensible hanld read_one_line_file() errors, without
having to updat errno explicitly.
In the process execution code of PID 1, before
096424d123 the GID settings where changed before
invoking PAM, and the UID settings after. After the change both changes are
made after the PAM session hooks are run. When invoking PAM we fork once, and
leave a stub process around which will invoke the PAM session end hooks when
the session goes away. This code previously was dropping the remaining privs
(which were precisely the UID). Fix this code to do this correctly again, by
really dropping them else (i.e. the GID as well).
While we are at it, also fix error logging of this code.
Fixes: #4238
If device access is restricted via PrivateDevices=, let's also block the
various low-level I/O syscalls at the same time, so that we know that the
minimal set of devices in our virtualized /dev are really everything the unit
can access.
If PrivateDevices=yes is set, the namespace code creates device nodes in /dev
that should be owned by the host's root, hence let's make sure we set up the
namespace before dropping group privileges.
This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
In https://github.com/systemd/systemd/pull/4004 , a runtime detection
method for seccomp was added. However, it does not detect the case
where CONFIG_SECCOMP=y but CONFIG_SECCOMP_FILTER=n. This is possible
if the architecture does not support filtering yet.
Add a check for that case too.
While at it, change get_proc_field usage to use PR_GET_SECCOMP prctl,
as that should save a few system calls and (unnecessary) allocations.
Previously, reading of /proc/self/stat was done as recommended by
prctl(2) as safer. However, given that we need to do the prctl call
anyway, lets skip opening, reading and parsing the file.
Code for checking inspired by
https://outflux.net/teach-seccomp/autodetect.html
dbus-daemon does NSS name look-ups in order to enforce its bus policy. This
might dead-lock if an NSS module use wants to use D-Bus for the look-up itself,
like our nss-systemd does. Let's work around this by bypassing bus
communication in the NSS module if we run inside of dbus-daemon. To make this
work we keep a bit of extra state in /run/systemd/dynamic-uid/ so that we don't
have to consult the bus, but can still resolve the names.
Note that the normal codepath continues to be via the bus, so that resolving
works from all mount namespaces and is subject to authentication, as before.
This is a bit dirty, but not too dirty, as dbus daemon is kinda special anyway
for PID 1.
This adds the boolean RemoveIPC= setting to service, socket, mount and swap
units (i.e. all unit types that may invoke processes). if turned on, and the
unit's user/group is not root, all IPC objects of the user/group are removed
when the service is shut down. The life-cycle of the IPC objects is hence bound
to the unit life-cycle.
This is particularly relevant for units with dynamic users, as it is essential
that no objects owned by the dynamic users survive the service exiting. In
fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set.
In order to communicate the UID/GID of an executed process back to PID 1 this
adds a new "user lookup" socket pair, that is inherited into the forked
processes, and closed before the exec(). This is needed since we cannot do NSS
from PID 1 due to deadlock risks, However need to know the used UID/GID in
order to clean up IPC owned by it if the unit shuts down.
The ExecParameters structure contains a number of bit-flags, that were so far
exposed as bool:1, change this to a proper, single binary bit flag field. This
makes things a bit more expressive, and is helpful as we add more flags, since
these booleans are passed around in various callers, for example
service_spawn(), whose signature can be made much shorter now.
Not all bit booleans from ExecParameters are moved into the flags field for
now, but this can be added later.
This setting adds minimal user namespacing support to a service. When set the invoked
processes will run in their own user namespace. Only a trivial mapping will be
set up: the root user/group is mapped to root, and the user/group of the
service will be mapped to itself, everything else is mapped to nobody.
If this setting is used the service runs with no capabilities on the host, but
configurable capabilities within the service.
This setting is particularly useful in conjunction with RootDirectory= as the
need to synchronize /etc/passwd and /etc/group between the host and the service
OS tree is reduced, as only three UID/GIDs need to match: root, nobody and the
user of the service itself. But even outside the RootDirectory= case this
setting is useful to substantially reduce the attack surface of a service.
Example command to test this:
systemd-run -p PrivateUsers=1 -p User=foobar -t /bin/sh
This runs a shell as user "foobar". When typing "ps" only processes owned by
"root", by "foobar", and by "nobody" should be visible.
This way, invoking nspawn from a shell in the best case inherits the TERM
setting all the way down into the login shell spawned in the container.
Fixes: #3697
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).
For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.
A simple way to test this is:
systemd-run -p DynamicUser=1 /bin/sleep 99999
All other functions in execute.c that need the unit id take a Unit* parameter
as first argument. Let's change connect_logger_as() to follow a similar logic.
This patch renames Read{Write,Only}Directories= and InaccessibleDirectories=
to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept
as aliases but they are not advertised in the documentation.
Renamed variables:
`read_write_dirs` --> `read_write_paths`
`read_only_dirs` --> `read_only_paths`
`inaccessible_dirs` --> `inaccessible_paths`
By cleaning up before setting up PAM we maintain control of overriding
behavior in setting variables. Otherwise, pam_putenv is in control.
This also makes sure we use a cleaned up environment in replacing
variables in argv.
This permits services to detect whether their stdout/stderr is connected to the
journal, and if so talk to the journal directly, thus permitting carrying of
metadata.
As requested by the gtk folks: #2473
Move the merger of environment variables before setting up the PAM
session and pass the aggregate environment to PAM setup. This allows
control over the PAM session hooks through environment variables.
PAM session initiation may update the environment. On successful
initiation of a PAM session, we adopt the environment of the
PAM context.
This patch implements the new magic character '!'. By putting '!' in front
of a command, systemd executes it with full privileges ignoring paramters
such as User, Group, SupplementaryGroups, CapabilityBoundingSet,
AmbientCapabilities, SecureBits, SystemCallFilter, SELinuxContext,
AppArmorProfile, SmackProcessLabel, and RestrictAddressFamilies.
Fixes partially https://github.com/systemd/systemd/issues/3414
Related to https://github.com/coreos/rkt/issues/2482
Testing:
1. Create a user 'bob'
2. Create the unit file /etc/systemd/system/exec-perm.service
(You can use the example below)
3. sudo systemctl start ext-perm.service
4. Verify that the commands starting with '!' were not executed as bob,
4.1 Looking to the output of ls -l /tmp/exec-perm
4.2 Each file contains the result of the id command.
`````````````````````````````````````````````````````````````````
[Unit]
Description=ext-perm
[Service]
Type=oneshot
TimeoutStartSec=0
User=bob
ExecStartPre=!/usr/bin/sh -c "/usr/bin/rm /tmp/exec-perm*" ;
/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-pre"
ExecStart=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start" ;
!/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-star-2"
ExecStartPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-post"
ExecReload=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-reload"
ExecStop=!/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop"
ExecStopPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop-post"
[Install]
WantedBy=multi-user.target]
`````````````````````````````````````````````````````````````````
Let's add an extra safety check before we chmod/chown a TTY to the right user,
as we might end up having connected something to STDIN/STDOUT that is actually
not a TTY, even though this might have been requested, due to permissive
StandardInput= settings or transient service activation with fds passed in.
Fixes:
https://bugs.freedesktop.org/show_bug.cgi?id=85255
New exec boolean MemoryDenyWriteExecute, when set, installs
a seccomp filter to reject mmap(2) with PAGE_WRITE|PAGE_EXEC
and mprotect(2) with PAGE_EXEC.
The macro determines the right length of a AF_UNIX "struct sockaddr_un" to pass to
connect() or bind(). It automatically figures out if the socket refers to an
abstract namespace socket, or a socket in the file system, and properly handles
the full length of the path field.
This macro is not only safer, but also simpler to use, than the usual
offsetof() + strlen() logic.
The manpage of seccomp specify that using seccomp with
SECCOMP_SET_MODE_FILTER will return EACCES if the caller do not have
CAP_SYS_ADMIN set, or if the no_new_privileges bit is not set. Hence,
without NoNewPrivilege set, it is impossible to use a SystemCall*
directive with a User directive set in system mode.
Now, NoNewPrivileges is set if we are in user mode, or if we are in
system mode and we don't have CAP_SYS_ADMIN, and SystemCall*
directives are used.
Throughout the tree there's spurious use of spaces separating ++ and --
operators from their respective operands. Make ++ and -- operator
consistent with the majority of existing uses; discard the spaces.
The setting is hardly useful (since its effect is generally reduced to zero due
to file system caps), and with the advent of ambient caps an actually useful
replacement exists, hence let's get rid of this.
I am pretty sure this was unused and our man page already recommended against
its use, hence this should be a safe thing to remove.
Assign errno-style errors to a variable called "r" when they happen, the same way we do this in most other calls. It's
bad enough that the error handling part of the function deals with two different error variables (pam_code and r) now,
but before this fix it was even three!
gcc is confused by the common idiom of
return errno ? -errno : -ESOMETHING
and thinks a positive value may be returned. Replace this condition
with errno > 0 to help gcc and avoid many spurious warnings. I filed
a gcc rfe a long time ago, but it hard to say if it will ever be
implemented [1].
Both conventions were used in the codebase, this change makes things
more consistent. This is a follow up to bcb161b023.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61846
This patch adds support for ambient capabilities in service files. The
idea with ambient capabilities is that the execed processes can run with
non-root user and get some inherited capabilities, without having any
need to add the capabilities to the executable file.
You need at least Linux 4.3 to use ambient capabilities. SecureBit
keep-caps is automatically added when you use ambient capabilities and
wish to change the user.
An example system service file might look like this:
[Unit]
Description=Service for testing caps
[Service]
ExecStart=/usr/bin/sleep 10000
User=nobody
AmbientCapabilities=CAP_NET_ADMIN CAP_NET_RAW
After starting the service it has these capabilities:
CapInh: 0000000000003000
CapPrm: 0000000000003000
CapEff: 0000000000003000
CapBnd: 0000003fffffffff
CapAmb: 0000000000003000
Change the capability bounding set parser and logic so that the bounding
set is kept as a positive set internally. This means that the set
reflects those capabilities that we want to keep instead of drop.
If pid < 0 after fork(), 0 is always returned because r =
exec_context_load_environment() has exited successfully.
This will make the caller of exec_spawn() not able to handle
the fork() error case and make systemd abort assert() possibly.
This directive allows passing environment variables from the system
manager to spawned services. Variables in the system manager can be set
inside a container by passing `--set-env=...` options to systemd-spawn.
Tested with an on-disk test.service unit. Tested using multiple variable
names on a single line, with an empty setting to clear the current list
of variables, with non-existing variables.
Tested using `systemd-run -p PassEnvironment=VARNAME` to confirm it
works with transient units.
Confirmed that `systemctl show` will display the PassEnvironment
settings.
Checked that man pages are generated correctly.
No regressions in `make check`.
The files are named too generically, so that they might conflict with
the upstream project headers. Hence, let's add a "-util" suffix, to
clarify that this are just our utility headers and not any official
upstream headers.
There are more than enough calls doing string manipulations to deserve
its own files, hence do something about it.
This patch also sorts the #include blocks of all files that needed to be
updated, according to the sorting suggestions from CODING_STYLE. Since
pretty much every file needs our string manipulation functions this
effectively means that most files have sorted #include blocks now.
Also touches a few unrelated include files.
Before, we'd always reset acquired terminals, which is not really
desired, as we expose a setting TTYReset= which is supposed to control
whether the TTY is reset or not. Previously that setting would only
enable a second resetting of the TTY, which is of course pointless...
Hence, move the implicit resetting out of acquire_terminal() and make
the callers do it if they need it.
When starting a transient service, allow setting stdin/stdout/stderr fds
for it, by passing them in via the bus.
This also simplifies some of the serialization code for units.
This adds support for caching harddisk passwords in the kernel keyring
if it is available, thus supporting caching without Plymouth being
around.
This is also useful for hooking up "gdm-auto-login" with the collected
boot-time harddisk password, in order to support gnome keyring
passphrase unlocking via the HDD password, if it is the same.
Any passwords added to the kernel keyring this way have a timeout of
2.5min at which time they are purged from the kernel.
This adds support for naming file descriptors passed using socket
activation. The names are passed in a new $LISTEN_FDNAMES= environment
variable, that matches the existign $LISTEN_FDS= one and contains a
colon-separated list of names.
This also adds support for naming fds submitted to the per-service fd
store using FDNAME= in the sd_notify() message.
This also adds a new FileDescriptorName= setting for socket unit files
to set the name for fds created by socket units.
This also adds a new call sd_listen_fds_with_names(), that is similar to
sd_listen_fds(), but also returns the names of the fds.
systemd-activate gained the new --fdname= switch to specify a name for
testing socket activation.
This is based on #1247 by Maciej Wereski.
Fixes#1247.
If set to ~ the working directory is set to the home directory of the
user configured in User=.
This change also exposes the existing switch for the working directory
that allowed making missing working directories non-fatal.
This also changes "machinectl shell" to make use of this to ensure that
the invoked shell is by default in the user's home directory.
Fixes#1268.
This cleans up exec_child() function by moving mac_smack_apply_pid()
and setup_pam() to the same condition block, since both of them have
the same condition (i.e params->apply_permissions). It improves
readability without changing its operation.
When 'SmackProcessLabel=' is used in user@.service file, all processes
launched in systemd user session should be labeled as the designated name
of 'SmackProcessLabel' directive. However, if systemd has its own smack
label using '--with-smack-run-label' configuration, '(sd-pam)' is
labeled as the specific name of '--with-smack-run-label'. If
'SmackProcessLabel=' is used in user@.service file without
'--with-smack-run-label' configuration, (sd-pam) is labeled as "_" since
systemd (i.e. pid=1) is labeled as "_".
This is mainly because setup_pam() function is called before applying
smack label to child process. This patch fixes it by calling setup_pam()
after setting the smack label.
If we spawn a unit with a non-empty 'PAMName=', we fork off a
child-process _inside_ the unit, known as '(sd-pam)', which watches the
session. It waits for the main-process to exit and then finishes it via
pam_close_session(3).
However, the '(sd-pam)' setup is highly asynchronous. There is no
guarantee that process gets spawned before we finish the unit setup.
Therefore, there might be a root-owned process inside of the cgroup of
the unit, thus causing cg_migrate() to error-out with EPERM.
This patch makes setup_pam() synchronous and waits for the '(sd-pam)'
setup to finish before continuing. This guarantees that setresuid(2) was
at least tried before we continue with the child setup of the real unit.
Note that if setresuid(2) fails, we already warn loudly about it. You
really must make sure that you own the passed user if using 'PAMName='.
It seems very plausible to rely on that assumption.
When Group is set in the unit, the runtime directories are owned by
this group and not the default group of the user (same for cgroup paths
and standard outputs)
Fix#1231
Turns this:
r = -errno;
log_error_errno(errno, "foo");
into this:
r = log_error_errno(errno, "foo");
and this:
r = log_error_errno(errno, "foo");
return r;
into this:
return log_error_errno(errno, "foo");
When generating utmp/wtmp entries, optionally add both LOGIN_PROCESS and
INIT_PROCESS entries or even all three of LOGIN_PROCESS, INIT_PROCESS
and USER_PROCESS entries, instead of just a single INIT_PROCESS entry.
With this change systemd may be used to not only invoke a getty directly
in a SysV-compliant way but alternatively also a login(1) implementation
or even forego getty and login entirely, and invoke arbitrary shells in
a way that they appear in who(1) or w(1).
This is preparation for a later commit that adds a "machinectl shell"
operation to invoke a shell in a container, in a way that is compatible
with who(1) and w(1).
If a service has both ExecStart= and ExecStartPost= set with
Type=simple, then it might happen that we have two children create the
runtime directory of a service (as configured with RuntimeDirectory=) at
the same time. Previously we did this with mkdir_safe() which will
create the dir only if it is missing, but if it already exists will at
least verify the access mode and ownership to match the right values.
This is problematic in this case, since it creates and then adjusts the
settings, thus it might happen that one child creates the directory with
root owner, another one then verifies it, and only afterwards the
directory ownership is fixed by the original child, while the second
child already failed.
With this change we'll now always adjust the access mode, so that we
know that it is right. In the worst case this means we adjust the
mode/ownership even though its unnecessary, but this should have no
negative effect.
https://bugzilla.redhat.com/show_bug.cgi?id=1226509
When command path has access label and no SmackProcessLabel= is not
set, default process label will be set. But if the default process
label has no rule for the access label of the command path then smack
access error will be occurred.
So, if the command path has execute label then the child have to set
its label to the same of execute label of command path instead of
default process label.
The latest consolidation cleanup of write_string_file() revealed some users
of that helper which should have used write_string_file_no_create() in the
past but didn't. Basically, all existing users that write to files in /sys
and /proc should not expect to write to a file which is not yet existant.
Merge write_string_file(), write_string_file_no_create() and
write_string_file_atomic() into write_string_file() and provide a flags mask
that allows combinations of atomic writing, newline appending and automatic
file creation. Change all users accordingly.
Similar to SmackProcessLabel=, if this configuration is set, systemd
executes processes with given SMACK label. If unit has
SmackProcessLabel=, this config is overwritten.
But, do NOT be confused with SMACK64EXEC of execute file. This default
execute process label(and also label which is set by
SmackProcessLabel=) is set fork-ed process SMACK subject label and
used to access the execute file.
If the execution file has also SMACK64EXEC, finally executed process
has SMACK64EXEC subject.
While if the execution file has no SMACK64EXEC, the executed process
has label of this config(or label which is set by
SmackProcessLabel=). Because if execution file has no SMACK64EXEC then
excuted process inherits label from caller process(in this case, the
caller is systemd).
./configure --enable/disable-kdbus can be used to set the default
behavior regarding kdbus.
If no kdbus kernel support is available, dbus-dameon will be used.
With --enable-kdbus, the kernel command line option "kdbus=0" can
be used to disable kdbus.
With --disable-kdbus, the kernel command line option "kdbus=1" is
required to enable kdbus support.
Commit 72c0a2c25 ("everywhere: port everything to sigprocmask_many()
and friends") reworked code tree-wide to use the new sigprocmask_many()
helper. In this, it caused a regression in pam_setup, because it
dropped a line to initialize the 'ss' signal mask which is later used
in sigwait().
While at it, move the variable declaration to an inner scope.
This ports a lot of manual code over to sigprocmask_many() and friends.
Also, we now consistly check for sigprocmask() failures with
assert_se(), since the call cannot realistically fail unless there's a
programming error.
Also encloses a few sd_event_add_signal() calls with (void) when we
ignore the return values for it knowingly.
Also, when the child is potentially long-running make sure to set a
death signal.
Also, ignore the result of the reset operations explicitly by casting
them to (void).
When a service is chrooted with the option RootDirectory=/opt/..., then
the options PrivateDevices, PrivateTmp, ProtectHome, ProtectSystem must
mount the directories under $RootDirectory/{dev,tmp,home,usr,boot}.
The test-ns tool can test setup_namespace() with and without chroot:
$ sudo TEST_NS_PROJECTS=/home/lennart/projects ./test-ns
$ sudo TEST_NS_CHROOT=/home/alban/debian-tree TEST_NS_PROJECTS=/home/alban/debian-tree/home/alban/Documents ./test-ns
This changes log_unit_info() (and friends) to take a real Unit* object
insted of just a unit name as parameter. The call will now prefix all
logged messages with the unit name, thus allowing the unit name to be
dropped from the various passed romat strings, simplifying invocations
drastically, and unifying log output across messages. Also, UNIT= vs.
USER_UNIT= is now derived from the Manager object attached to the Unit
object, instead of getpid(). This has the benefit of correcting the
field for --test runs.
Also contains a couple of other logging improvements:
- Drops a couple of strerror() invocations in favour of using %m.
- Not only .mount units now warn if a symlinks exist for the mount
point already, .automount units do that too, now.
- A few invocations of log_struct() that didn't actually pass any
additional structured data have been replaced by simpler invocations
of log_unit_info() and friends.
- For structured data a new LOG_UNIT_MESSAGE() macro has been added,
that works like LOG_MESSAGE() but prefixes the message with the unit
name. Similar, there's now LOG_LINK_MESSAGE() and
LOG_NETDEV_MESSAGE().
- For structured data new LOG_UNIT_ID(), LOG_LINK_INTERFACE(),
LOG_NETDEV_INTERFACE() macros have been added that generate the
necessary per object fields. The old log_unit_struct() call has been
removed in favour of these new macros used in raw log_struct()
invocations. In addition to removing one more function call this
allows generated structured log messages that contain two object
fields, as necessary for example for network interfaces that are
joined into another network interface, and whose messages shall be
indexed by both.
- The LOG_ERRNO() macro has been removed, in favour of
log_struct_errno(). The latter has the benefit of ensuring that %m in
format strings is properly resolved to the specified error number.
- A number of logging messages have been converted to use
log_unit_info() instead of log_info()
- The client code in sysv-generator no longer #includes core code from
src/core/.
- log_unit_full_errno() has been removed, log_unit_full() instead takes
an errno now, too.
- log_unit_info(), log_link_info(), log_netdev_info() and friends, now
avoid double evaluation of their parameters