Commit graph

139 commits

Author SHA1 Message Date
Lennart Poettering 2d3a5a73e0 nspawn: make sure images containing an ESP are compatible with userns -U mode
In -U mode we might need to re-chown() all files and directories to
match the UID shift we want for the image. That's problematic on fat
partitions, such as the ESP (and which is generated by mkosi's
--bootable switch), because fat of course knows no UID/GID file
ownership natively.

With this change we take benefit of the uid= and gid= mount options FAT
knows: instead of chown()ing all files and directories we can just
specify the right UID/GID to use at mount time.

This beefs up the image dissection logic in two ways:

1. First of all support for mounting relevant file systems with
   uid=/gid= is added: when a UID is specified during mount it is used for
   all applicable file systems.

2. Secondly, two new mount flags are added:
   DISSECT_IMAGE_MOUNT_ROOT_ONLY and DISSECT_IMAGE_MOUNT_NON_ROOT_ONLY.
   If one is specified the mount routine will either only mount the root
   partition of an image, or all partitions except the root partition.
   This is used by nspawn: first the root partition is mounted, so that
   we can determine the UID shift in use so far, based on ownership of
   the image's root directory. Then, we mount the remaining partitions
   in a second go, this time with the right UID/GID information.
2017-12-05 13:49:12 +01:00
Shawn Landden 4831981d89 tree-wide: adjust fall through comments so that gcc is happy
Distcc removes comments, making the comment silencing
not work.

I know there was a decision against a macro in commit
ec251fe7d5
2017-11-20 13:06:25 -08:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Lennart Poettering 4e0c20de97 namespace: set up OS hierarchy only after mounting the new root, not before
Otherwise it's a pointless excercise, as we'll set up an empty directory
tree that's never going to be used.

Hence, let's move this around a bit, so that we do the basesystem
initialization exactly when RootImage= or RootDirectory= are used, but
not otherwise.
2017-11-13 10:22:36 +01:00
Yu Watanabe d18aff0422 core: ReadWritePaths= and friends assume '+' prefix when BindPaths= or freinds are set
When at least one of BindPaths=, BindReadOnlyPaths=, RootImage=,
RuntimeDirectory= or their friends are set, systemd prepares
a namespace under /run/systemd/unit-root. Thus, ReadWritePaths=
or their friends without '+' prefix is completely meaningless.
So, let's assume '+' prefix when one of them are set.

Fixes #7070 and #7080.
2017-11-08 15:48:01 +09:00
Lennart Poettering 0fa5b8312a namespace: make ns_type_supported() a tiny bit shorter
namespace_type_to_string() already validates the type paramater, we can
use that, and shorten the function a bit.
2017-10-10 09:52:08 +02:00
Lennart Poettering bb0ff3fb1b namespace: change NameSpace → Namespace
We generally use the casing "Namespace" for the word, and that's visible
in a number of user-facing interfaces, including "RestrictNamespace=" or
"JoinsNamespaceOf=". Let's make sure to use the same casing internally
too.

As discussed in #7024
2017-10-10 09:51:58 +02:00
Michal Sekletar 6e2d7c4f13 namespace: fall back gracefully when kernel doesn't support network namespaces (#7024) 2017-10-10 09:46:13 +02:00
Zbigniew Jędrzejewski-Szmek 349cc4a507 build-sys: use #if Y instead of #ifdef Y everywhere
The advantage is that is the name is mispellt, cpp will warn us.

$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build

squash! build-sys: use #if Y instead of #ifdef Y everywhere

v2:
- fix incorrect setting of HAVE_LIBIDN2
2017-10-04 12:09:29 +02:00
Lennart Poettering 6c47cd7d3b execute: make StateDirectory= and friends compatible with DynamicUser=1 and RootDirectory=/RootImage=
Let's clean up the interaction of StateDirectory= (and friends) to
DynamicUser=1: instead of creating these directories directly below
/var/lib, place them in /var/lib/private instead if DynamicUser=1 is
set, making that directory 0700 and owned by root:root. This way, if a
dynamic UID is later reused, access to the old run's state directory is
prohibited for that user. Then, use file system namespacing inside the
service to make /var/lib/private a readable tmpfs, hiding all state
directories that are not listed in StateDirectory=, and making access to
the actual state directory possible. Mount all directories listed in
StateDirectory= to the same places inside the service (which means
they'll now be mounted into the tmpfs instance). Finally, add a symlink
from the state directory name in /var/lib/ to the one in
/var/lib/private, so that both the host and the service can access the
path under the same location.

Here's an example: let's say a service runs with StateDirectory=foo.
When DynamicUser=0 is set, it will get the following setup, and no
difference between what the unit and what the host sees:

        /var/lib/foo (created as directory)

Now, if DynamicUser=1 is set, we'll instead get this on the host:

        /var/lib/private (created as directory with mode 0700, root:root)
        /var/lib/private/foo (created as directory)
        /var/lib/foo → private/foo (created as symlink)

And from inside the unit:

        /var/lib/private (a tmpfs mount with mode 0755, root:root)
        /var/lib/private/foo (bind mounted from the host)
        /var/lib/foo → private/foo (the same symlink as above)

This takes inspiration from how container trees are protected below
/var/lib/machines: they generally reuse UIDs/GIDs of the host, but
because /var/lib/machines itself is set to 0700 host users cannot access
files in the container tree even if the UIDs/GIDs are reused. However,
for this commit we add one further trick: inside and outside of the unit
/var/lib/private is a different thing: outside it is a plain,
inaccessible directory, and inside it is a world-readable tmpfs mount
with only the whitelisted subdirs below it, bind mounte din.  This
means, from the outside the dir acts as an access barrier, but from the
inside it does not. And the symlink created in /var/lib/foo itself
points across the barrier in both cases, so that root and the unit's
user always have access to these dirs without knowing the details of
this mounting magic.

This logic resolves a major shortcoming of DynamicUser=1 units:
previously they couldn't safely store persistant data. With this change
they can have their own private state, log and data directories, which
they can write to, but which are protected from UID recycling.

With this change, if RootDirectory= or RootImage= are used it is ensured
that the specified state/log/cache directories are always mounted in
from the host. This change of semantics I think is much preferable since
this means the root directory/image logic can be used easily for
read-only resource bundling (as all writable data resides outside of the
image). Note that this is a change of behaviour, but given that we
haven't released any systemd version with StateDirectory= and friends
implemented this should be a safe change to make (in particular as
previously it wasn't clear what would actually happen when used in
combination). Moreover, by making this change we can later add a "+"
modifier to these setings too working similar to the same modifier in
ReadOnlyPaths= and friends, making specified paths relative to the
container itself.
2017-10-02 17:41:44 +02:00
Lennart Poettering a227a4be48 namespace: if we can create the destination of bind and PrivateTmp= mounts
When putting together the namespace, always create the file or directory
we are supposed to bind mount on, the same way we do it for most other
stuff, for example mount units or systemd-nspawn's --bind= option.

This has the big benefit that we can use namespace bind mounts on dirs
in /tmp or /var/tmp even in conjunction with PrivateTmp=.
2017-10-02 17:41:43 +02:00
Lennart Poettering e908468b5b namespace: properly handle bind mounts from the host
Before this patch we had an ordering problem: if we have no namespacing
enabled except for two bind mounts that intend to swap /a and /b via
bind mounts, then we'd execute the bind mount binding /b to /a, followed
by thebind mount from /a to /b, thus having the effect that /b is now
visible in both /a and /b, which was not intended.

With this change, as soon as any bind mount is configured we'll put
together the service mount namespace in a temporary directory instead of
operating directly in the root. This solves the problem in a
straightforward fashion: the source of bind mounts will always refer to
the host, and thus be unaffected from the bind mounts we already
created.
2017-10-02 17:41:43 +02:00
Lennart Poettering 645767d6b5 namespace: create /dev, /proc, /sys when needed
We already create /dev implicitly if PrivateTmp=yes is on, if it is
missing. Do so too for the other two API VFS, as well as for /dev if
PrivateTmp=yes is off but MountAPIVFS=yes is on (i.e. when /dev is bind
mounted from the host).
2017-10-02 17:41:43 +02:00
Topi Miettinen 07ce74074d namespace: avoid assertion failure (#6649)
If the root image is not decrypted, it must not be relinquished.
2017-08-29 17:31:24 +02:00
Nicolas Iooss 3a0bf6d6aa namespace: keep selinuxfs mounted read-write with ProtectKernelTunables (#5741)
When a service unit uses "ProtectKernelTunables=yes", it currently
remounts /sys/fs/selinux read-only. This makes libselinux report SELinux
state as "disabled", because most SELinux features are not usable. For
example it is not possible to validate security contexts (with
security_check_context_raw() or /sys/fs/selinux/context). This behavior
of libselinux has been described in
http://danwalsh.livejournal.com/73099.html and confirmed in a recent
email, https://marc.info/?l=selinux&m=149220233032594&w=2 .

Since commit 0c28d51ac8 ("units: further lock down our long-running
services"), systemd-localed unit uses ProtectKernelTunables=yes.
Nevertheless this service needs to use libselinux API in order to create
/etc/vconsole.conf, /etc/locale.conf... with the right SELinux contexts.
This is broken when /sys/fs/selinux is mounted read-only in the mount
namespace of the service.

Make SELinux-aware systemd services work again when they are using
ProtectKernelTunables=yes by keeping selinuxfs mounted read-write.
2017-07-31 17:45:33 +02:00
Timothée Ravier ac9de0b379 core: open /proc/self/mountinfo early to allow mounts over /proc (#5985)
Enable masking the /proc folder using the 'InaccessiblePaths' unit
option.

This also slightly simplify mounts setup as the bind_remount_recursive
function will only open /proc/self/mountinfo once.

This is based on the suggestion at:
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038634.html
2017-05-19 14:38:40 +02:00
Djalal Harouni 9c988f934b namespace: Apply MountAPIVFS= only when a Root directory is set
The MountAPIVFS= documentation says that this options has no effect
unless used in conjunction with RootDirectory= or RootImage= ,lets fix
this and avoid to create private mount namespaces where it is not
needed.
2017-03-05 21:39:43 +01:00
Djalal Harouni 10404d52e3 namespace: create base-filesystem directories if RootImage= or RootDirectory= are set
When a service is started with its own file system image, always try to
create the base-filesystem directories that are needed. This implicitly
covers the directories handled by MountAPIVFS= {/proc|/sys|/dev}.

Mount protections or MountAPIVFS= mounts were never applied if we
changed the root directory and the related paths were not present under
the new root. The mounts were silently. Fix this by creating those
directories if they are missing.

Closes https://github.com/systemd/systemd/issues/5488
2017-03-05 21:19:29 +01:00
AsciiWolf 13e785f7a0 Fix missing space in comments (#5439) 2017-02-24 18:14:02 +01:00
Lennart Poettering 78ebe98061 core,nspawn,dissect: make nspawn's .roothash file search reusable
This makes nspawn's logic of automatically discovering the root hash of
an image file generic, and then reuses it in systemd-dissect and in
PID1's RootImage= logic, so that verity is automatically set up whenever
we can.
2017-02-07 12:21:28 +01:00
Lennart Poettering 915e6d1676 core: add RootImage= setting for using a specific image file as root directory for a service
This is similar to RootDirectory= but mounts the root file system from a
block device or loopback file instead of another directory.

This reuses the image dissector code now used by nspawn and
gpt-auto-discovery.
2017-02-07 12:19:42 +01:00
Lennart Poettering 5d997827e2 core: add a per-unit setting MountAPIVFS= for mounting /dev, /proc, /sys in conjunction with RootDirectory=
This adds a boolean unit file setting MountAPIVFS=. If set, the three
main API VFS mounts will be mounted for the service. This only has an
effect on RootDirectory=, which it makes a ton times more useful.

(This is basically the /dev + /proc + /sys mounting code posted in the
original #4727, but rebased on current git, and with the automatic logic
replaced by explicit logic controlled by a unit file setting)
2017-02-07 11:22:05 +01:00
Lennart Poettering 1eb7e08e20 core: fix minor memleak in namespace.c
The source_malloc field wants to be freed, too.
2017-02-07 11:22:05 +01:00
Lennart Poettering d2d6c096f6 core: add ability to define arbitrary bind mounts for services
This adds two new settings BindPaths= and BindReadOnlyPaths=. They allow
defining arbitrary bind mounts specific to particular services. This is
particularly useful for services with RootDirectory= set as this permits making
specific bits of the host directory available to chrooted services.

The two new settings follow the concepts nspawn already possess in --bind= and
--bind-ro=, as well as the .nspawn settings Bind= and BindReadOnly= (and these
latter options should probably be renamed to BindPaths= and BindReadOnlyPaths=
too).

Fixes: #3439
2016-12-14 00:54:10 +01:00
Lennart Poettering 8fceda937f namespace: instead of chasing mount symlinks a priori, do so as-we-go
This is relevant as many of the mounts we try to establish only can be followed
when some other prior mount that is a prefix of it is established. Hence: move
the symlink chasing into the actual mount functions, so that we do it as late
as possibly but as early as necessary.

Fixes: #4588
2016-12-14 00:51:37 +01:00
Lennart Poettering 34de407a4f core: rename BindMount structure → MountEntry
After all, these don#t strictly encapsulate bind mounts anymore, and we are
preparing this for adding arbitrary user-defined bind mounts in a later commit,
at which point this would become really confusing. Let's clean this up, rename
the BindMount structure to MountEntry, so that it is clear that it can contain
information about any kind of mount.
2016-12-14 00:48:52 +01:00
Lennart Poettering cfbeb4ef8d namespace: add explicit read-only flag
This reworks handling of the read-only management for mount points. This will
become handy as soon as we add arbitrary bind mount support (which comes in a
later commit).
2016-12-14 00:42:01 +01:00
Lennart Poettering ddbe041277 namespace: reindent protect_system_strict_table[] as well
All other tables got reindented, but one was forgotten. Fix that.
2016-12-13 21:22:13 +01:00
Lennart Poettering c4f4fce79e fs-util: add flags parameter to chase_symlinks()
Let's remove chase_symlinks_prefix() and instead introduce a flags parameter to
chase_symlinks(), with a flag CHASE_PREFIX_ROOT that exposes the behaviour of
chase_symlinks_prefix().
2016-12-01 00:25:51 +01:00
Lennart Poettering e187369587 tree-wide: stop using canonicalize_file_name(), use chase_symlinks() instead
Let's use chase_symlinks() everywhere, and stop using GNU
canonicalize_file_name() everywhere. For most cases this should not change
behaviour, however increase exposure of our function to get better tested. Most
importantly in a few cases (most notably nspawn) it can take the correct root
directory into account when chasing symlinks.
2016-12-01 00:25:51 +01:00
Lennart Poettering aa70f38b5c namespace: clarify that /proc/apm is obsolete, but leave it blocked 2016-11-17 18:10:30 +01:00
Lennart Poettering c6232fb0e9 namespace: reindent namespace tables
Let's align all our BindMount tables, let's use the same column widths in all
of them, and let's make them not any wider than necessary.

This only changes whitespace, not contents of any of the tables.
2016-11-17 18:09:16 +01:00
Lennart Poettering 5327c910d2 namespace: simplify, optimize and extend handling of mounts for namespace
This changes a couple of things in the namespace handling:

It merges the BindMount and TargetMount structures. They are mostly the same,
hence let's just use the same structue, and rely on C's implicit zero
initialization of partially initialized structures for the unneeded fields.

This reworks memory management of each entry a bit. It now contains one "const"
and one "malloc" path. We use the former whenever we can, but use the latter
when we have to, which is the case when we have to chase symlinks or prefix a
root directory. This means in the common case we don't actually need to
allocate any dynamic memory. To make this easy to use we add an accessor
function bind_mount_path() which retrieves the right path string from a
BindMount structure.

While we are at it, also permit "+" as prefix for dirs configured with
ReadOnlyPaths= and friends: if specified the root directory of the unit is
implicited prefixed.

This also drops set_bind_mount() and uses C99 structure initialization instead,
which I think is more readable and clarifies what is being done.

This drops append_protect_kernel_tunables() and
append_protect_kernel_modules() as append_static_mounts() is now simple enough
to be called directly.

Prefixing with the root dir is now done in an explicit step in
prefix_where_needed(). It will prepend the root directory on each entry that
doesn't have it prefixed yet. The latter is determined depending on an extra
bit in the BindMount structure.
2016-11-17 18:08:32 +01:00
Djalal Harouni 1d54cd5d25 core:namespace: count and free failed paths inside chase_all_symlinks() (#4619)
This certainly fixes a bug that was introduced by PR
https://github.com/systemd/systemd/pull/4594 that intended to fix
https://github.com/systemd/systemd/issues/4567.

The fix was not complete. This patch makes sure that we count and free
all paths that fail inside chase_all_symlinks().

Fixes https://github.com/systemd/systemd/issues/4567
2016-11-10 12:11:37 -05:00
Djalal Harouni af964954c6 core: on DynamicUser= make sure that protecting sensitive paths is enforced (#4596)
This adds a variable that is always set to false to make sure that
protect paths inside sandbox are always enforced and not ignored. The only
case when it is set to true is on DynamicUser=no and RootDirectory=/chroot
is set. This allows users to use more our sandbox features inside RootDirectory=

The only exception is ProtectSystem=full|strict and when DynamicUser=yes
is implied. Currently RootDirectory= is not fully compatible with these
due to two reasons:

* /chroot/usr|etc has to be present on ProtectSystem=full
* /chroot// has to be a mount point on ProtectSystem=strict.
2016-11-08 21:57:32 -05:00
Zbigniew Jędrzejewski-Szmek 46c3230dd0 nspawn: slight simplification 2016-11-07 08:57:30 -05:00
Zbigniew Jędrzejewski-Szmek 49fedb4094 nspawn: avoid one strdup by using free_and_replace 2016-11-07 08:54:47 -05:00
Djalal Harouni f0a4feb0a5 core: make RootDirectory= and ProtectKernelModules= work
Instead of having two fields inside BindMount struct where one is stack
based and the other one is heap, use one field to store the full path
and updated it when we chase symlinks. This way we avoid dealing with
both at the same time.

This makes RootDirectory= work with ProtectHome= and ProtectKernelModules=yes

Fixes: https://github.com/systemd/systemd/issues/4567
2016-11-07 12:34:52 +01:00
Zbigniew Jędrzejewski-Szmek 605405c6cc tree-wide: drop NULL sentinel from strjoin
This makes strjoin and strjoina more similar and avoids the useless final
argument.

spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/systemd -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libsystemd/sd-bus -I ./src/libsystemd/sd-event -I ./src/libsystemd/sd-login -I ./src/libsystemd/sd-netlink -I ./src/libsystemd/sd-network -I ./src/libsystemd/sd-hwdb -I ./src/libsystemd/sd-device -I ./src/libsystemd/sd-id128 -I ./src/libsystemd-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/*.c)

git grep -e '\bstrjoin\b.*NULL' -l|xargs sed -i -r 's/strjoin\((.*), NULL\)/strjoin(\1)/'

This might have missed a few cases (spatch has a really hard time dealing
with _cleanup_ macros), but that's no big issue, they can always be fixed
later.
2016-10-23 11:43:27 -04:00
Djalal Harouni c575770b75 core:sandbox: lets make /lib/modules/ inaccessible on ProtectKernelModules=
Lets go further and make /lib/modules/ inaccessible for services that do
not have business with modules, this is a minor improvment but it may
help on setups with custom modules and they are limited... in regard of
kernel auto-load feature.

This change introduce NameSpaceInfo struct which we may embed later
inside ExecContext but for now lets just reduce the argument number to
setup_namespace() and merge ProtectKernelModules feature.
2016-10-12 14:11:16 +02:00
Djalal Harouni b6c432ca7e core:namespace: simplify ProtectHome= implementation
As with previous patch simplify ProtectHome and don't care about
duplicates, they will be sorted by most restrictive mode and cleaned.
2016-09-25 12:41:16 +02:00
Djalal Harouni f471b2afa1 core: simplify ProtectSystem= implementation
ProtectSystem= with all its different modes and other options like
PrivateDevices= + ProtectKernelTunables= + ProtectHome= are orthogonal,
however currently it's a bit hard to parse that from the implementation
view. Simplify it by giving each mode its own table with all paths and
references to other Protect options.

With this change some entries are duplicated, but we do not care since
duplicate mounts are first sorted by the most restrictive mode then
cleaned.
2016-09-25 12:21:25 +02:00
Djalal Harouni 49accde7bd core:sandbox: add more /proc/* entries to ProtectKernelTunables=
Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface,
filesystems configuration and IRQ tuning readonly.

Most of these interfaces now days should be in /sys but they are still
available through /proc, so just protect them. This patch does not touch
/proc/net/...
2016-09-25 11:30:11 +02:00
Djalal Harouni 2652c6c103 core:namespace: simplify mount calculation
Move out mount calculation on its own function. Actually the logic is
smart enough to later drop nop and duplicates mounts, this change
improves code readability.
---
 src/core/namespace.c | 47 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 36 insertions(+), 11 deletions(-)
2016-09-25 11:25:00 +02:00
Djalal Harouni 11a30cec2a core:namespace: put paths protected by ProtectKernelTunables= in
Instead of having all these paths everywhere, put the ones that are
protected by ProtectKernelTunables= into their own table. This way it
is easy to add paths and track which ones are protected.
2016-09-25 11:16:44 +02:00
Djalal Harouni 9c94d52e09 core:namespace: minor improvements to append_mounts() 2016-09-25 11:03:21 +02:00
Lennart Poettering cd2902c954 namespace: drop all mounts outside of the new root directory
There's no point in mounting these, if they are outside of the root directory
we'll move to.
2016-09-25 10:52:57 +02:00
Lennart Poettering 8f1ad200f0 namespace: don't make the root directory of a namespace a mount if it already is one
Let's not stack mounts needlessly.
2016-09-25 10:42:18 +02:00
Lennart Poettering d944dc9553 namespace: chase symlinks for mounts to set up in userspace
This adds logic to chase symlinks for all mount points that shall be created in
a namespace environment in userspace, instead of leaving this to the kernel.
This has the advantage that we can correctly handle absolute symlinks that
shall be taken relative to a specific root directory. Moreover, we can properly
handle mounts created on symlinked files or directories as we can merge their
mounts as necessary.

(This also drops the "done" flag in the namespace logic, which was never
actually working, but was supposed to permit a partial rollback of the
namespace logic, which however is only mildly useful as it wasn't clear in
which case it would or would not be able to roll back.)

Fixes: #3867
2016-09-25 10:42:18 +02:00
Lennart Poettering 1e4e94c881 namespace: invoke unshare() only after checking all parameters
Let's create the new namespace only after we validated and processed all
parameters, right before we start with actually mounting things.

This way, the window where we can roll back is larger (not that it matters
IRL...)
2016-09-25 10:42:18 +02:00