Commit graph

236 commits

Author SHA1 Message Date
Lennart Poettering d2d6c096f6 core: add ability to define arbitrary bind mounts for services
This adds two new settings BindPaths= and BindReadOnlyPaths=. They allow
defining arbitrary bind mounts specific to particular services. This is
particularly useful for services with RootDirectory= set as this permits making
specific bits of the host directory available to chrooted services.

The two new settings follow the concepts nspawn already possess in --bind= and
--bind-ro=, as well as the .nspawn settings Bind= and BindReadOnly= (and these
latter options should probably be renamed to BindPaths= and BindReadOnlyPaths=
too).

Fixes: #3439
2016-12-14 00:54:10 +01:00
Jouke Witteveen a4e26faf33 man: fix $SERVICE_RESULT/$EXIT_CODE/$EXIT_STATUS documentation
Note that any exit code is available through $EXIT_STATUS and not through
$EXIT_CODE. This mimics siginfo.
2016-12-06 13:37:14 +01:00
Jouke Witteveen 7ed0a4c537 bus-util: add protocol error type explanation 2016-11-29 23:19:52 +01:00
Jouke Witteveen e0c7d5f7be man: document protocol error type for service failures (#4724) 2016-11-23 22:51:33 +01:00
Lennart Poettering 1a1b13c957 seccomp: add @filesystem syscall group (#4537)
@filesystem groups various file system operations, such as opening files and
directories for read/write and stat()ing them, plus renaming, deleting,
symlinking, hardlinking.
2016-11-21 19:29:12 -05:00
Lennart Poettering 5327c910d2 namespace: simplify, optimize and extend handling of mounts for namespace
This changes a couple of things in the namespace handling:

It merges the BindMount and TargetMount structures. They are mostly the same,
hence let's just use the same structue, and rely on C's implicit zero
initialization of partially initialized structures for the unneeded fields.

This reworks memory management of each entry a bit. It now contains one "const"
and one "malloc" path. We use the former whenever we can, but use the latter
when we have to, which is the case when we have to chase symlinks or prefix a
root directory. This means in the common case we don't actually need to
allocate any dynamic memory. To make this easy to use we add an accessor
function bind_mount_path() which retrieves the right path string from a
BindMount structure.

While we are at it, also permit "+" as prefix for dirs configured with
ReadOnlyPaths= and friends: if specified the root directory of the unit is
implicited prefixed.

This also drops set_bind_mount() and uses C99 structure initialization instead,
which I think is more readable and clarifies what is being done.

This drops append_protect_kernel_tunables() and
append_protect_kernel_modules() as append_static_mounts() is now simple enough
to be called directly.

Prefixing with the root dir is now done in an explicit step in
prefix_where_needed(). It will prepend the root directory on each entry that
doesn't have it prefixed yet. The latter is determined depending on an extra
bit in the BindMount structure.
2016-11-17 18:08:32 +01:00
Djalal Harouni 8526555680 doc: move ProtectKernelModules= documentation near ProtectKernelTunalbes= 2016-11-15 15:04:41 +01:00
Djalal Harouni a7db8614f3 doc: note when no new privileges is implied 2016-11-15 15:04:35 +01:00
Lennart Poettering add005357d core: add new RestrictNamespaces= unit file setting
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().

RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.

This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
2016-11-04 07:40:13 -06:00
Zbigniew Jędrzejewski-Szmek cf88547034 Merge pull request #4548 from keszybz/seccomp-help
systemd-analyze syscall-filter
2016-11-03 20:27:45 -04:00
Kees Cook d974f949f1 doc: clarify NoNewPrivileges (#4562)
Setting no_new_privs does not stop UID changes, but rather blocks
gaining privileges through execve(). Also fixes a small typo.
2016-11-03 20:26:59 -04:00
Zbigniew Jędrzejewski-Szmek d5efc18b60 seccomp-util, analyze: export comments as a help string
Just to make the whole thing easier for users.
2016-11-03 09:35:36 -04:00
Zbigniew Jędrzejewski-Szmek 869feb3388 analyze: add syscall-filter verb
This should make it easier for users to understand what each filter
means as the list of syscalls is updated in subsequent systemd versions.
2016-11-03 09:35:35 -04:00
Lennart Poettering 2ca8dc15f9 man: document that too strict system call filters may affect the service manager
If execve() or socket() is filtered the service manager might get into trouble
executing the service binary, or handling any failures when this fails. Mention
this in the documentation.

The other option would be to implicitly whitelist all system calls that are
required for these codepaths. However, that appears less than desirable as this
would mean socket() and many related calls have to be whitelisted
unconditionally. As writing system call filters requires a certain level of
expertise anyway it sounds like the better option to simply document these
issues and suggest that the user disables system call filters in the service
temporarily in order to debug any such failures.

See: #3993.
2016-11-02 08:55:24 -06:00
Lennart Poettering 133ddbbeae seccomp: add two new syscall groups
@resources contains various syscalls that alter resource limits and memory and
scheduling parameters of processes. As such they are good candidates to block
for most services.

@basic-io contains a number of basic syscalls for I/O, similar to the list
seccomp v1 permitted but slightly more complete. It should be useful for
building basic whitelisting for minimal sandboxes
2016-11-02 08:50:00 -06:00
Lennart Poettering aa6b9cec88 man: two minor fixes 2016-11-02 08:50:00 -06:00
Lennart Poettering cd5bfd7e60 seccomp: include pipes and memfd in @ipc
These system calls clearly fall in the @ipc category, hence should be listed
there, simply to avoid confusion and surprise by the user.
2016-11-02 08:50:00 -06:00
Lennart Poettering a8c157ff30 seccomp: drop execve() from @process list
The system call is already part in @default hence implicitly allowed anyway.
Also, if it is actually blocked then systemd couldn't execute the service in
question anymore, since the application of seccomp is immediately followed by
it.
2016-11-02 08:49:59 -06:00
Lennart Poettering c79aff9a82 seccomp: add clock query and sleeping syscalls to "@default" group
Timing and sleep are so basic operations, it makes very little sense to ever
block them, hence don't.
2016-11-02 08:49:59 -06:00
Zbigniew Jędrzejewski-Szmek aa34055ffb seccomp: allow specifying arm64, mips, ppc (#4491)
"Secondary arch" table for mips is entirely speculative…
2016-11-01 09:33:18 -06:00
Jakub Wilk b17649ee5e man: fix typos (#4527) 2016-10-31 08:08:08 -04:00
Djalal Harouni fa1f250d6f Merge pull request #4495 from topimiettinen/block-shmat-exec
seccomp: also block shmat(..., SHM_EXEC) for MemoryDenyWriteExecute
2016-10-28 15:41:07 +02:00
Topi Miettinen d2ffa389b8 seccomp: also block shmat(..., SHM_EXEC) for MemoryDenyWriteExecute
shmat(..., SHM_EXEC) can be used to create writable and executable
memory, so let's block it when MemoryDenyWriteExecute is set.
2016-10-26 18:59:14 +03:00
Zbigniew Jędrzejewski-Szmek 74388c2d11 man: document the default value of NoNewPrivileges=
Fixes #4329.
2016-10-24 23:45:57 -04:00
Lennart Poettering 47da760efd man: document default for User=
Replaces: #4375
2016-10-20 13:21:25 +02:00
Luca Bruno 52c239d770 core/exec: add a named-descriptor option ("fd") for streams (#4179)
This commit adds a `fd` option to `StandardInput=`,
`StandardOutput=` and `StandardError=` properties in order to
connect standard streams to externally named descriptors provided
by some socket units.

This option looks for a file descriptor named as the corresponding
stream. Custom names can be specified, separated by a colon.
If multiple name-matches exist, the first matching fd will be used.
2016-10-17 20:05:49 -04:00
Lennart Poettering c7458f9399 man: avoid abbreviated "cgroups" terminology (#4396)
Let's avoid the overly abbreviated "cgroups" terminology. Let's instead write:

"Linux Control Groups (cgroups)" is the long form wherever the term is
introduced in prose. Use "control groups" in the short form wherever the term
is used within brief explanations.

Follow-up to: #4381
2016-10-17 09:50:26 -04:00
Zbigniew Jędrzejewski-Szmek 74b47bbd5d man: add crosslink between systemd.resource-control(5) and systemd.exec(5)
Fixes #4379.
2016-10-15 18:38:20 -04:00
Lennart Poettering 8bfdf29b24 Merge pull request #4243 from endocode/djalal/sandbox-first-protection-kernelmodules-v1
core:sandbox: Add ProtectKernelModules= and some fixes
2016-10-13 18:36:29 +02:00
Thomas Hindoe Paaboel Andersen 2dd678171e man: typo fixes
A mix of fixes for typos and UK english
2016-10-12 23:02:44 +02:00
Djalal Harouni c575770b75 core:sandbox: lets make /lib/modules/ inaccessible on ProtectKernelModules=
Lets go further and make /lib/modules/ inaccessible for services that do
not have business with modules, this is a minor improvment but it may
help on setups with custom modules and they are limited... in regard of
kernel auto-load feature.

This change introduce NameSpaceInfo struct which we may embed later
inside ExecContext but for now lets just reduce the argument number to
setup_namespace() and merge ProtectKernelModules feature.
2016-10-12 14:11:16 +02:00
Djalal Harouni ac246d9868 doc: minor hint about InaccessiblePaths= in regard of ProtectKernelTunables= 2016-10-12 13:52:40 +02:00
Djalal Harouni 2cd0a73547 core:sandbox: remove CAP_SYS_RAWIO on PrivateDevices=yes
The rawio system calls were filtered, but CAP_SYS_RAWIO allows to access raw
data through /proc, ioctl and some other exotic system calls...
2016-10-12 13:39:49 +02:00
Djalal Harouni 502d704e5e core:sandbox: Add ProtectKernelModules= option
This is useful to turn off explicit module load and unload operations on modular
kernels. This option removes CAP_SYS_MODULE from the capability bounding set for
the unit, and installs a system call filter to block module system calls.

This option will not prevent the kernel from loading modules using the module
auto-load feature which is a system wide operation.
2016-10-12 13:31:21 +02:00
Zbigniew Jędrzejewski-Szmek 56b4c80b42 Merge pull request #4348 from poettering/docfixes
Various smaller documentation fixes.
2016-10-11 13:49:15 -04:00
Lennart Poettering f4c9356d13 man: beef up documentation on per-unit resource limits a bit
Let's clarify that for user services some OS-defined limits bound the settings
in the unit files.

Fixes: #4232
2016-10-11 18:42:22 +02:00
Lennart Poettering 4b58153dd2 core: add "invocation ID" concept to service manager
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.

The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.

The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.

The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.

Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.

This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().

PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.

A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.

Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
2016-10-07 20:14:38 +02:00
hbrueckner 6abfd30372 seccomp: add support for the s390 architecture (#4287)
Add seccomp support for the s390 architecture (31-bit and 64-bit)
to systemd.

This requires libseccomp >= 2.3.1.
2016-10-05 13:58:55 +02:00
Stefan Schweter cfaf4b75e0 man: remove consecutive duplicate words (#4268)
This PR removes consecutive duplicate words from the man pages of:

* `resolved.conf.xml`
* `systemd.exec.xml`
* `systemd.socket.xml`
2016-10-03 17:09:54 +02:00
Djalal Harouni 8f81a5f61b core: Use @raw-io syscall group to filter I/O syscalls when PrivateDevices= is set
Instead of having a local syscall list, use the @raw-io group which
contains the same set of syscalls to filter.
2016-09-25 12:52:27 +02:00
Djalal Harouni 49accde7bd core:sandbox: add more /proc/* entries to ProtectKernelTunables=
Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface,
filesystems configuration and IRQ tuning readonly.

Most of these interfaces now days should be in /sys but they are still
available through /proc, so just protect them. This patch does not touch
/proc/net/...
2016-09-25 11:30:11 +02:00
Djalal Harouni 9221aec8d0 doc: explicitly document that /dev/mem and /dev/port are blocked by PrivateDevices=true 2016-09-25 11:25:44 +02:00
Djalal Harouni e778185bb5 doc: documentation fixes for ReadWritePaths= and ProtectKernelTunables=
Documentation fixes for ReadWritePaths= and ProtectKernelTunables=
as reported by Evgeny Vereshchagin.
2016-09-25 11:25:31 +02:00
Lennart Poettering 6757c06a1a man: shorten the exit status table a bit
Let's merge a couple of columns, to make the table a bit shorter. This
effectively just drops whitespace, not contents, but makes the currently
humungous table much much more compact.
2016-09-25 10:52:57 +02:00
Lennart Poettering 81c8aceed4 man: the exit code/signal is stored in $EXIT_CODE, not $EXIT_STATUS 2016-09-25 10:52:57 +02:00
Lennart Poettering effbd6d2ea man: rework documentation for ReadOnlyPaths= and related settings
This reworks the documentation for ReadOnlyPaths=, ReadWritePaths=,
InaccessiblePaths=. It no longer claims that we'd follow symlinks relative to
the host file system. (Which wasn't true actually, as we didn't follow symlinks
at all in the most recent releases, and we know do follow them, but relative to
RootDirectory=).

This also replaces all references to the fact that all fs namespacing options
can be undone with enough privileges and disable propagation by a single one in
the documentation of ReadOnlyPaths= and friends, and then directs the read to
this in all other places.

Moreover a hint is added to the documentation of SystemCallFilter=, suggesting
usage of ~@mount in case any of the fs namespacing related options are used.
2016-09-25 10:42:18 +02:00
Lennart Poettering b2656f1b1c man: in user-facing documentaiton don't reference C function names
Let's drop the reference to the cap_from_name() function in the documentation
for the capabilities setting, as it is hardly helpful. Our readers are not
necessarily C hackers knowing the semantics of cap_from_name(). Moreover, the
strings we accept are just the plain capability names as listed in
capabilities(7) hence there's really no point in confusing the user with
anything else.
2016-09-25 10:42:18 +02:00
Lennart Poettering 63bb64a056 core: imply ProtectHome=read-only and ProtectSystem=strict if DynamicUser=1
Let's make sure that services that use DynamicUser=1 cannot leave files in the
file system should the system accidentally have a world-writable directory
somewhere.

This effectively ensures that directories need to be whitelisted rather than
blacklisted for access when DynamicUser=1 is set.
2016-09-25 10:42:18 +02:00
Lennart Poettering 3f815163ff core: introduce ProtectSystem=strict
Let's tighten our sandbox a bit more: with this change ProtectSystem= gains a
new setting "strict". If set, the entire directory tree of the system is
mounted read-only, but the API file systems /proc, /dev, /sys are excluded
(they may be managed with PrivateDevices= and ProtectKernelTunables=). Also,
/home and /root are excluded as those are left for ProtectHome= to manage.

In this mode, all "real" file systems (i.e. non-API file systems) are mounted
read-only, and specific directories may only be excluded via
ReadWriteDirectories=, thus implementing an effective whitelist instead of
blacklist of writable directories.

While we are at, also add /efi to the list of paths always affected by
ProtectSystem=. This is a follow-up for
b52a109ad3 which added /efi as alternative for
/boot. Our namespacing logic should respect that too.
2016-09-25 10:42:18 +02:00
Lennart Poettering 59eeb84ba6 core: add two new service settings ProtectKernelTunables= and ProtectControlGroups=
If enabled, these will block write access to /sys, /proc/sys and
/proc/sys/fs/cgroup.
2016-09-25 10:18:48 +02:00