Commit graph

312 commits

Author SHA1 Message Date
Franck Bui eedf223a30 core: add 'i' in confirm spawn to give a short summary of the unit to spawn 2016-11-17 18:16:50 +01:00
Franck Bui d172b175f6 core: rework the confirmation spawn prompt
Previously it was "[Yes, Fail, Skip]" which is pretty misleading because it
suggests that the whole word needs to be entered instead of a single char.

Also this won't fit well when we'll extend the number of choices.

This patch addresses this by changing the choice hint with "[y, f, s – h for help]"
so it's now clear that a single letter has to be entered.

It also introduces a new choice 'h' which describes all possible choices since
a single letter can be not descriptive enough for new users.

It also allow to stick with the same hint string regardless of how
many choices we will support.
2016-11-17 18:16:50 +01:00
Franck Bui 2bcd3c26fe core: limit the length of the confirmation question
When "confirmation_spawn=1", the confirmation question can look like:

  Execute /usr/bin/kmod static-nodes --format=tmpfiles --output=/run/tmpfiles.d/kmod.conf? [Yes, No, Skip]

which is pretty verbose and might not fit in the console width size (which is
usually 80 chars) and thus question will be splitted into 2 consecutive lines.

However since the question is now refreshed every 2 secs, the reprinted
question will overwrite the second line of the previous one...

To prevent this, this patch makes sure that the command line won't be longer
than 60 chars by ellipsizing it if the command is longer:

  Execute /usr/bin/kmod static-nodes --format=tmpfiles --output=/ru…nf? [Yes, No, View, Skip]

A following patch will introduce a new choice that will allow the user to get
details on the command to be executed so it will still be possible to see the
full command line.
2016-11-17 18:16:50 +01:00
Franck Bui 2bcc330942 core: in confirm_spawn, the meaning of 'n' and 's' choices are confusing
Before this patch we had:

 - "no" which gives "failing execution" but the command is actually assumed as
   succeed.

 - "skip" which gives "skipping", but the command is assumed to have failed,
   which ends up with "Failed to start ..." on the console.

Now we have:

 - "fail" which gives "failing execution" and the command is indeed assumed as
   failed.

 - "skip" which gives "skipping execution" and the command is assumed as
   succeed.
2016-11-17 18:16:49 +01:00
Franck Bui 3b20f877ad core: rework ask_for_confirmation()
Now the reponses are handled by ask_for_confirmation() as well as the report of
any errors occuring during the process of retrieving the confirmation response.

One benefit of this is that there's no need to open/close the console one more
time when reporting error/status messages.

The caller now just needs to care about the return values whose meanings are:

 - don't execute and pretend that the command failed
 - don't execute and pretend that the command succeeed
 - positive answer, execute the command

Also some slight code reorganization and introduce write_confirm_error() and
write_confirm_error_fd(). write_confim_message becomes unneeded.
2016-11-17 18:16:49 +01:00
Franck Bui 7d5ceb6416 core: allow to redirect confirmation messages to a different console
It's rather hard to parse the confirmation messages (enabled with
systemd.confirm_spawn=true) amongst the status messages and the kernel
ones (if enabled).

This patch gives the possibility to the user to redirect the confirmation
message to a different virtual console, either by giving its name or its path,
so those messages are separated from the other ones and easier to read.
2016-11-17 18:16:16 +01:00
Djalal Harouni af964954c6 core: on DynamicUser= make sure that protecting sensitive paths is enforced (#4596)
This adds a variable that is always set to false to make sure that
protect paths inside sandbox are always enforced and not ignored. The only
case when it is set to true is on DynamicUser=no and RootDirectory=/chroot
is set. This allows users to use more our sandbox features inside RootDirectory=

The only exception is ProtectSystem=full|strict and when DynamicUser=yes
is implied. Currently RootDirectory= is not fully compatible with these
due to two reasons:

* /chroot/usr|etc has to be present on ProtectSystem=full
* /chroot// has to be a mount point on ProtectSystem=strict.
2016-11-08 21:57:32 -05:00
Zbigniew Jędrzejewski-Szmek d85a0f8028 Merge pull request #4536 from poettering/seccomp-namespaces
core: add new RestrictNamespaces= unit file setting

Merging, not rebasing, because this touches many files and there were tree-wide cleanups in the mean time.
2016-11-08 19:54:21 -05:00
Zbigniew Jędrzejewski-Szmek f97b34a629 Rename formats-util.h to format-util.h
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
2016-11-07 10:15:08 -05:00
Lennart Poettering add005357d core: add new RestrictNamespaces= unit file setting
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().

RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.

This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
2016-11-04 07:40:13 -06:00
Lennart Poettering 493fd52f1a Merge pull request #4510 from keszybz/tree-wide-cleanups
Tree wide cleanups
2016-11-03 13:59:20 -06:00
Djalal Harouni cdc5d5c55e core: intialize user aux groups and SupplementaryGroups= when DynamicUser= is set
Make sure that when DynamicUser= is set that we intialize the user
supplementary groups and that we also support SupplementaryGroups=

Fixes: https://github.com/systemd/systemd/issues/4539

Thanks Evgeny Vereshchagin (@evverx)
2016-11-03 08:36:53 +01:00
Lennart Poettering 32e134c19f Merge pull request #4483 from poettering/exec-order
more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec
2016-11-02 16:09:59 -06:00
Djalal Harouni bbeea27117 core: initialize groups list before checking SupplementaryGroups= of a unit (#4533)
Always initialize the supplementary groups of caller before checking the
unit SupplementaryGroups= option.

Fixes https://github.com/systemd/systemd/issues/4531
2016-11-02 10:51:35 -06:00
Lennart Poettering 5cd9cd3537 execute: apply seccomp filters after changing selinux/aa/smack contexts
Seccomp is generally an unprivileged operation, changing security contexts is
most likely associated with some form of policy. Moreover, while seccomp may
influence our own flow of code quite a bit (much more than the security context
change) make sure to apply the seccomp filters immediately before executing the
binary to invoke.

This also moves enforcement of NNP after the security context change, so that
NNP cannot affect it anymore. (However, the security policy now has to permit
the NNP change).

This change has a good chance of breaking current SELinux/AA/SMACK setups, because
the policy might not expect this change of behaviour. However, it's technically
the better choice I think and should hence be applied.

Fixes: #3993
2016-11-02 08:55:00 -06:00
Djalal Harouni fa1f250d6f Merge pull request #4495 from topimiettinen/block-shmat-exec
seccomp: also block shmat(..., SHM_EXEC) for MemoryDenyWriteExecute
2016-10-28 15:41:07 +02:00
Djalal Harouni 59e856c7d3 core: make unit argument const for apply seccomp functions 2016-10-27 09:40:22 +02:00
Djalal Harouni 50b3dfb9d6 core: lets apply working directory just after mount namespaces
This makes applying groups after applying the working directory, this
may allow some flexibility but at same it is not a big deal since we
don't execute or do anything between applying working directory and
droping groups.
2016-10-27 09:40:21 +02:00
Djalal Harouni 2b3c1b9e9d core: get the working directory value inside apply_working_directory()
Improve apply_working_directory() and lets get the current working directory
inside of it.
2016-10-27 09:40:21 +02:00
Djalal Harouni e7f1e7c6e2 core: move apply working directory code into its own apply_working_directory() 2016-10-27 09:40:21 +02:00
Djalal Harouni 93c6bb51b6 core: move the code that setups namespaces on its own function 2016-10-27 09:40:21 +02:00
Topi Miettinen d2ffa389b8 seccomp: also block shmat(..., SHM_EXEC) for MemoryDenyWriteExecute
shmat(..., SHM_EXEC) can be used to create writable and executable
memory, so let's block it when MemoryDenyWriteExecute is set.
2016-10-26 18:59:14 +03:00
Lennart Poettering a3be2849b2 seccomp: add new helper call seccomp_load_filter_set()
This allows us to unify most of the code in apply_protect_kernel_modules() and
apply_private_devices().
2016-10-24 17:32:50 +02:00
Lennart Poettering 8d7b0c8fd7 seccomp: add new seccomp_init_conservative() helper
This adds a new seccomp_init_conservative() helper call that is mostly just a
wrapper around seccomp_init(), but turns off NNP and adds in all secondary
archs, for best compatibility with everything else.

Pretty much all of our code used the very same constructs for these three
steps, hence unifying this in one small function makes things a lot shorter.

This also changes incorrect usage of the "scmp_filter_ctx" type at various
places. libseccomp defines it as typedef to "void*", i.e. it is a pointer type
(pretty poor choice already!) that casts implicitly to and from all other
pointer types (even poorer choice: you defined a confusing type now, and don't
even gain any bit of type safety through it...). A lot of the code assumed the
type would refer to a structure, and hence aded additional "*" here and there.
Remove that.
2016-10-24 17:32:50 +02:00
Lennart Poettering 25a8d8a0cb core: rework apply_protect_kernel_modules() to use seccomp_add_syscall_filter_set()
Let's simplify this call, by making use of the new infrastructure.

This is actually more in line with Djalal's original patch but instead of
search the filter set in the array by its name we can now use the set index and
jump directly to it.
2016-10-24 17:32:50 +02:00
Lennart Poettering 8130926d32 core: rework syscall filter set handling
A variety of fixes:

- rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main
  instance of it (the syscall_filter_sets[] array) used to abbreviate
  "SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not
  mix and match too wildly. Let's pick the shorter name in this case, as it is
  sufficiently well established to not confuse hackers reading this.

- Export explicit indexes into the syscall_filter_sets[] array via an enum.
  This way, code that wants to make use of a specific filter set, can index it
  directly via the enum, instead of having to search for it. This makes
  apply_private_devices() in particular a lot simpler.

- Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to
  find a set by its name, seccomp_add_syscall_filter_set() to add a set to a
  seccomp object.

- Update SystemCallFilter= parser to use extract_first_word().  Let's work on
  deprecating FOREACH_WORD_QUOTED().

- Simplify apply_private_devices() using this functionality
2016-10-24 17:32:50 +02:00
Lennart Poettering e0f3720e39 core: move misplaced comment to the right place 2016-10-24 17:29:51 +02:00
Lennart Poettering f673b62df6 core: simplify skip_seccomp_unavailable() a bit
Let's prefer early-exit over deep-indented if blocks. Not behavioural change.
2016-10-24 17:29:51 +02:00
Djalal Harouni 366ddd252e core: do not assert when sysconf(_SC_NGROUPS_MAX) fails (#4466)
Remove the assert and check the return code of sysconf(_SC_NGROUPS_MAX).

_SC_NGROUPS_MAX maps to NGROUPS_MAX which is defined in <limits.h> to
65536 these days. The value is a sysctl read-only
/proc/sys/kernel/ngroups_max and the kernel assumes that it is always
positive otherwise things may break. Follow this and support only
positive values for all other case return either -errno or -EOPNOTSUPP.

Now if there are systems that want to re-write NGROUPS_MAX then they
should not pass SupplementaryGroups= in units even if it is empty, in
this case nothing fails and we just ignore supplementary groups. However
if SupplementaryGroups= is passed even if it is empty we have to assume
that there will be groups manipulation from our side or the kernel and
since the kernel always assumes that NGROUPS_MAX is positive, then
follow that and support only positive values.
2016-10-24 13:13:06 +02:00
Djalal Harouni 8b6903ad4d core: lets move the setup of working directory before group enforce
This is minor but lets try to split and move bit by bit cgroups and
portable environment setup before applying the security context.
2016-10-23 23:27:20 +02:00
Djalal Harouni 4d885bd326 core: first lookup and cache creds then apply them after namespace setup
This fixes: https://github.com/systemd/systemd/issues/4357

Let's lookup and cache creds then apply them. We also switch from
getgroups() to getgrouplist().
2016-10-23 23:24:14 +02:00
Zbigniew Jędrzejewski-Szmek 605405c6cc tree-wide: drop NULL sentinel from strjoin
This makes strjoin and strjoina more similar and avoids the useless final
argument.

spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/systemd -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libsystemd/sd-bus -I ./src/libsystemd/sd-event -I ./src/libsystemd/sd-login -I ./src/libsystemd/sd-netlink -I ./src/libsystemd/sd-network -I ./src/libsystemd/sd-hwdb -I ./src/libsystemd/sd-device -I ./src/libsystemd/sd-id128 -I ./src/libsystemd-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/*.c)

git grep -e '\bstrjoin\b.*NULL' -l|xargs sed -i -r 's/strjoin\((.*), NULL\)/strjoin(\1)/'

This might have missed a few cases (spatch has a really hard time dealing
with _cleanup_ macros), but that's no big issue, they can always be fixed
later.
2016-10-23 11:43:27 -04:00
Luca Bruno 52c239d770 core/exec: add a named-descriptor option ("fd") for streams (#4179)
This commit adds a `fd` option to `StandardInput=`,
`StandardOutput=` and `StandardError=` properties in order to
connect standard streams to externally named descriptors provided
by some socket units.

This option looks for a file descriptor named as the corresponding
stream. Custom names can be specified, separated by a colon.
If multiple name-matches exist, the first matching fd will be used.
2016-10-17 20:05:49 -04:00
Zbigniew Jędrzejewski-Szmek 6b430fdb7c tree-wide: use mfree more 2016-10-16 23:35:39 -04:00
Djalal Harouni e66a2f658b core: make sure to dump ProtectKernelModules= value 2016-10-12 14:12:17 +02:00
Djalal Harouni 4084e8fc89 core: check protect_kernel_modules and private_devices in order to setup NNP 2016-10-12 14:12:07 +02:00
Djalal Harouni c575770b75 core:sandbox: lets make /lib/modules/ inaccessible on ProtectKernelModules=
Lets go further and make /lib/modules/ inaccessible for services that do
not have business with modules, this is a minor improvment but it may
help on setups with custom modules and they are limited... in regard of
kernel auto-load feature.

This change introduce NameSpaceInfo struct which we may embed later
inside ExecContext but for now lets just reduce the argument number to
setup_namespace() and merge ProtectKernelModules feature.
2016-10-12 14:11:16 +02:00
Djalal Harouni 502d704e5e core:sandbox: Add ProtectKernelModules= option
This is useful to turn off explicit module load and unload operations on modular
kernels. This option removes CAP_SYS_MODULE from the capability bounding set for
the unit, and installs a system call filter to block module system calls.

This option will not prevent the kernel from loading modules using the module
auto-load feature which is a system wide operation.
2016-10-12 13:31:21 +02:00
Lennart Poettering e0d2adfde6 core: chown() any TTY used for stdin, not just when StandardInput=tty is used (#4347)
If stdin is supplied as an fd for transient units (using the
StandardInputFileDescriptor pseudo-property for transient units), then we
should also fix up the TTY ownership, not just when we opened the TTY
ourselves.

This simply drops the explicit is_terminal_input()-based check. Note that
chown_terminal() internally does a much more appropriate isatty()-based check
anyway, hence we can drop this without replacement.

Fixes: #4260
2016-10-11 14:07:22 -04:00
Lennart Poettering 4b58153dd2 core: add "invocation ID" concept to service manager
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.

The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.

The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.

The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.

Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.

This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().

PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.

A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.

Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
2016-10-07 20:14:38 +02:00
Lennart Poettering 97f0e76f18 user-util: rework maybe_setgroups() a bit
Let's drop the caching of the setgroups /proc field for now. While there's a
strict regime in place when it changes states, let's better not cache it since
we cannot really be sure we follow that regime correctly.

More importantly however, this is not in performance sensitive code, and
there's no indication the cache is really beneficial, hence let's drop the
caching and make things a bit simpler.

Also, while we are at it, rework the error handling a bit, and always return
negative errno-style error codes, following our usual coding style. This has
the benefit that we can sensible hanld read_one_line_file() errors, without
having to updat errno explicitly.
2016-10-06 19:04:10 +02:00
Lennart Poettering 2d6fce8d7c core: leave PAM stub process around with GIDs updated
In the process execution code of PID 1, before
096424d123 the GID settings where changed before
invoking PAM, and the UID settings after. After the change both changes are
made after the PAM session hooks are run. When invoking PAM we fork once, and
leave a stub process around which will invoke the PAM session end hooks when
the session goes away. This code previously was dropping the remaining privs
(which were precisely the UID). Fix this code to do this correctly again, by
really dropping them else (i.e. the GID as well).

While we are at it, also fix error logging of this code.

Fixes: #4238
2016-10-06 19:04:10 +02:00
Giuseppe Scrivano 36d854780c core: do not fail in a container if we can't use setgroups
It might be blocked through /proc/PID/setgroups
2016-10-06 11:49:00 +02:00
Stefan Schweter 629ff674ac tree-wide: remove consecutive duplicate words in comments 2016-10-04 17:06:25 +02:00
Djalal Harouni 8f81a5f61b core: Use @raw-io syscall group to filter I/O syscalls when PrivateDevices= is set
Instead of having a local syscall list, use the @raw-io group which
contains the same set of syscalls to filter.
2016-09-25 12:52:27 +02:00
Lennart Poettering cefc33aee2 execute: move SMACK setup code into its own function
While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the
main execution logic a bit.
2016-09-25 10:52:57 +02:00
Lennart Poettering ba128bb809 execute: filter low-level I/O syscalls if PrivateDevices= is set
If device access is restricted via PrivateDevices=, let's also block the
various low-level I/O syscalls at the same time, so that we know that the
minimal set of devices in our virtualized /dev are really everything the unit
can access.
2016-09-25 10:52:57 +02:00
Lennart Poettering 096424d123 execute: drop group priviliges only after setting up namespace
If PrivateDevices=yes is set, the namespace code creates device nodes in /dev
that should be owned by the host's root, hence let's make sure we set up the
namespace before dropping group privileges.
2016-09-25 10:42:18 +02:00
Lennart Poettering 3fbe8dbe41 execute: if RuntimeDirectory= is set, it should be writable
Implicitly make all dirs set with RuntimeDirectory= writable, as the concept
otherwise makes no sense.
2016-09-25 10:19:05 +02:00
Lennart Poettering be39ccf3a0 execute: move suppression of HOME=/ and SHELL=/bin/nologin into user-util.c
This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
2016-09-25 10:18:57 +02:00