Commit graph

170 commits

Author SHA1 Message Date
Lennart Poettering c9ef8573be namespace: don't consider raw image read-only if /home in it is writable 2018-04-18 14:15:48 +02:00
Lennart Poettering 12777909c9
Merge pull request #8417 from brauner/2018-03-09/add_bind_mount_fallback_to_private_devices
core: fall back to bind-mounts for PrivateDevices= execution environments
2018-04-18 11:56:56 +02:00
Zbigniew Jędrzejewski-Szmek af984e137e core/namespace: rework the return semantics of clone_device_node yet again
Returning 0 on not-found/wrong-type is confusing. Let's return -ENXIO in that
case instead, and explicitly ignore it in the call site where we want to do that.
I think this is clearer and less likely to be used errenously in case another
call site is added.

C.f. 152c475f95 and 98b1d2b8d9.
2018-04-12 18:15:33 +02:00
Christian Brauner 1649861744 core: fall back to bind-mounts for PrivateDevices= execution environments
In environments where CAP_MKNOD is not available or inside
user namespaces it is still desirable to enable services to use
PrivateDevices= . So fall back to using bind-mounts on EPERM.
2018-04-12 18:15:12 +02:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Yu Watanabe 1cc6c93a95 tree-wide: use TAKE_PTR() and TAKE_FD() macros 2018-04-05 14:26:26 +09:00
Lennart Poettering 62570f6f03 fs-util: add new CHASE_TRAIL_SLASH flag for chase_symlinks()
This rearranges chase_symlinks() a bit: if no special flags are
specified it will now revert to behaviour before
b12d25a8d6. However, if the new
CHASE_TRAIL_SLASH flag is specified it will follow the behaviour
introduced by that commit.

I wasn't sure which one to make the beaviour that requires specification
of a flag to enable. I opted to make the "append trailing slash"
behaviour the one to enable by a flag, following the thinking that the
function should primarily be used to generate a normalized path, and I
am pretty sure a path without trailing slash is the more "normalized"
one, as the trailing slash is not really a part of it, but merely a
"decorator" that tells various system calls to generate ENOTDIR if the
path doesn't refer to a path.

Or to say this differently: if the slash was part of normalization then
we really should add it in all cases when the final path is a directory,
not just when the user originally specified it.

Fixes: #8544
Replaces: #8545
2018-03-22 19:54:24 +01:00
Zbigniew Jędrzejewski-Szmek 671f0f8de0 Remove /sbin from paths if split-bin is false (#8324)
Follow-up for 157baa87e4.
2018-03-01 21:48:36 +01:00
Ansgar Burchardt 7486f305cd Include additional directories in ProtectSystem 2018-02-27 18:56:19 -03:00
Zbigniew Jędrzejewski-Szmek aa484f3561 tree-wide: use reallocarray instead of our home-grown realloc_multiply (#8279)
There isn't much difference, but in general we prefer to use the standard
functions. glibc provides reallocarray since version 2.26.

I moved explicit_bzero is configure test to the bottom, so that the two stdlib
functions are at the bottom.
2018-02-26 21:20:00 +01:00
Lennart Poettering 13a141f046 namespace: protect bpf file system as part of ProtectKernelTunables=
It also exposes kernel objects, let's better include this in
ProtectKernelTunables=.
2018-02-21 16:43:36 +01:00
Yu Watanabe e4da7d8c79 core: add new option 'tmpfs' to ProtectHome=
This make ProtectHome= setting can take 'tmpfs'. This is mostly
equivalent to `TemporaryFileSystem=/home /run/user /root`.
2018-02-21 09:18:17 +09:00
Yu Watanabe 2abd4e388a core: add new setting TemporaryFileSystem=
This introduces a new setting TemporaryFileSystem=. This is useful
to hide files not relevant to the processes invoked by unit, while
necessary files or directories can be still accessed by combining
with Bind{,ReadOnly}Paths=.
2018-02-21 09:17:52 +09:00
Yu Watanabe 4ca763a902 core/namespace: make '-' prefix in Bind{,ReadOnly}Paths= work
Each path in `Bind{ReadOnly}Paths=` accept '-' prefix. However,
the prefix is completely ignored.
This makes it work as expected.
2018-02-21 09:07:56 +09:00
Yu Watanabe f5c52a7724 core/namespace: remove unused argument 2018-02-21 09:05:30 +09:00
Yu Watanabe e282f51f57 core/namespace: use free_and_replace() 2018-02-21 09:05:21 +09:00
Yu Watanabe 55fe743273 core/namespace: fix comment 2018-02-21 09:05:18 +09:00
Yu Watanabe 89bd586cd3 core/namespace: merge PRIVATE_VAR_TMP into PRIVATE_TMP 2018-02-21 09:05:16 +09:00
Yu Watanabe 2a2969fd5d core/namespace: make arguments const if possible 2018-02-21 09:05:14 +09:00
Zbigniew Jędrzejewski-Szmek f863b1c6fa core: move very long argument to a separate statement
I like compact, but this was a bit too much.
2018-02-15 10:10:01 +01:00
Lennart Poettering 152c475f95 namepace: fix error handling when clone_device_node() returns 0
Before this patch, we'd treat clone_device_node() returning 0 (as
opposed to 1) as error, but then propagate this non-error result in
confusion.

This makes sure that if we ptmx isn't around we propagate that as
-ENXIO.

This is a follow-up for 98b1d2b8d9
2018-01-23 19:50:32 +01:00
Lennart Poettering 36ce7110b0 namespace: use is_symlink() helper
We have this prett ylittle helper, let's use it, it makes things a tiny
bit more readable.
2018-01-23 19:36:55 +01:00
Lennart Poettering 6f7f3a3351 namespace: use stack allocation for paths, where we can 2018-01-23 19:36:36 +01:00
Alan Jenkins 68f7480b7e
Merge pull request #7913 from sourcejedi/devpts
3 nitpicks from core/namespace.c
2018-01-18 21:56:26 +00:00
Alan Jenkins 225874dc9c core: clone_device_node(): add debug message
For people who use debug messages, maybe it is helpful to know that
PrivateDevices= failed due to mknod(), and which device node.

(The other (un-logged) failures could be while mounting filesystems e.g. no
CAP_SYS_ADMIN which is the common case, or missing /dev/shm or /dev/pts,
or missing /dev/ptmx).
2018-01-18 13:58:13 +00:00
Alan Jenkins 8d95368210 core: namespace: remove unnecessary mode on /dev/shm mount target
This should have no behavioural effect; it just confused me.

All the other mount directories in this function are created as 0755.
Some of the mounts are allowed to fail - mqueue and hugepages.
If the /dev/mqueue mount target was created with the permissive mode 01777,
to match the filesystem we're trying to mount there, then a mount failure
would allow unprivileged users to write to the /dev filesystem, e.g. to
exhaust the available space.  There is no reason to allow this.

(Allowing the user read access (0755) seems a reasonable idea though, e.g. for
quicker troubleshooting.)

We do not allow failure of the /dev/shm mount, so it doesn't matter that
it is created as 01777.  But on the same grounds, we have no *reason* to
create it as any specific mode.  0755 is equally fine.

This function will be clearer by using 0755 throughout, to avoid
unintentionally implying some connection between the mode of the mount
target, and the mode of the mounted filesystem.
2018-01-17 18:04:34 +00:00
Alan Jenkins 98b1d2b8d9 core: namespace: nitpick /dev/ptmx error handling
If /dev/tty did not exist, or had st_rdev == 0, we ignored it.  And the
same is true for null, zero, full, random, urandom.

If /dev/ptmx did not exist, we treated this as a failure.  If /dev/ptmx had
st_rdev == 0, we ignored it.

This was a very recent change, but there was no reason for ptmx creation
specifically to treat st_rdev == 0 differently from non-existence.  This
confuses me when reading it.

Change the creation of /dev/ptmx so that st_rdev == 0 is
treated as failure.

This still leaves /dev/ptmx as a special case with stricter handling.
However it is consistent with the immediately preceding creation of
/dev/pts/, which is treated as essential, and is directly related to ptmx.

I don't know why we check st_rdev.  But I'd prefer to have only one
unanswered question here, and not to have a second unanswered question
added on top.
2018-01-17 13:28:32 +00:00
Дамјан Георгиевски 414b304ba2 namespace: only make the symlink /dev/ptmx if it was already a symlink
…otherwise try to clone it as a device node

On most contemporary distros /dev/ptmx is a device node, and
/dev/pts/ptmx has 000 inaccessible permissions. In those cases
the symlink /dev/ptmx -> /dev/pts/ptmx breaks the pseudo tty support.

In that case we better clone the device node.

OTOH, in nspawn containers (and possibly others), /dev/pts/ptmx has
normal permissions, and /dev/ptmx is a symlink. In that case make the
same symlink.

fixes #7878
2018-01-17 01:19:46 +01:00
Дамјан Георгиевски b5e99f23ed namespace: extract clone_device_node function from mount_private_dev 2018-01-16 21:41:10 +01:00
Yu Watanabe 03c791aa24 namespace: introduce parse_protect_system()_or_bool 2018-01-02 02:23:13 +09:00
Yu Watanabe 5e1c61544c namespace: introduce parse_protect_home_or_bool() 2018-01-02 02:23:05 +09:00
Lennart Poettering 2d3a5a73e0 nspawn: make sure images containing an ESP are compatible with userns -U mode
In -U mode we might need to re-chown() all files and directories to
match the UID shift we want for the image. That's problematic on fat
partitions, such as the ESP (and which is generated by mkosi's
--bootable switch), because fat of course knows no UID/GID file
ownership natively.

With this change we take benefit of the uid= and gid= mount options FAT
knows: instead of chown()ing all files and directories we can just
specify the right UID/GID to use at mount time.

This beefs up the image dissection logic in two ways:

1. First of all support for mounting relevant file systems with
   uid=/gid= is added: when a UID is specified during mount it is used for
   all applicable file systems.

2. Secondly, two new mount flags are added:
   DISSECT_IMAGE_MOUNT_ROOT_ONLY and DISSECT_IMAGE_MOUNT_NON_ROOT_ONLY.
   If one is specified the mount routine will either only mount the root
   partition of an image, or all partitions except the root partition.
   This is used by nspawn: first the root partition is mounted, so that
   we can determine the UID shift in use so far, based on ownership of
   the image's root directory. Then, we mount the remaining partitions
   in a second go, this time with the right UID/GID information.
2017-12-05 13:49:12 +01:00
Shawn Landden 4831981d89 tree-wide: adjust fall through comments so that gcc is happy
Distcc removes comments, making the comment silencing
not work.

I know there was a decision against a macro in commit
ec251fe7d5
2017-11-20 13:06:25 -08:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Lennart Poettering 4e0c20de97 namespace: set up OS hierarchy only after mounting the new root, not before
Otherwise it's a pointless excercise, as we'll set up an empty directory
tree that's never going to be used.

Hence, let's move this around a bit, so that we do the basesystem
initialization exactly when RootImage= or RootDirectory= are used, but
not otherwise.
2017-11-13 10:22:36 +01:00
Yu Watanabe d18aff0422 core: ReadWritePaths= and friends assume '+' prefix when BindPaths= or freinds are set
When at least one of BindPaths=, BindReadOnlyPaths=, RootImage=,
RuntimeDirectory= or their friends are set, systemd prepares
a namespace under /run/systemd/unit-root. Thus, ReadWritePaths=
or their friends without '+' prefix is completely meaningless.
So, let's assume '+' prefix when one of them are set.

Fixes #7070 and #7080.
2017-11-08 15:48:01 +09:00
Lennart Poettering 0fa5b8312a namespace: make ns_type_supported() a tiny bit shorter
namespace_type_to_string() already validates the type paramater, we can
use that, and shorten the function a bit.
2017-10-10 09:52:08 +02:00
Lennart Poettering bb0ff3fb1b namespace: change NameSpace → Namespace
We generally use the casing "Namespace" for the word, and that's visible
in a number of user-facing interfaces, including "RestrictNamespace=" or
"JoinsNamespaceOf=". Let's make sure to use the same casing internally
too.

As discussed in #7024
2017-10-10 09:51:58 +02:00
Michal Sekletar 6e2d7c4f13 namespace: fall back gracefully when kernel doesn't support network namespaces (#7024) 2017-10-10 09:46:13 +02:00
Zbigniew Jędrzejewski-Szmek 349cc4a507 build-sys: use #if Y instead of #ifdef Y everywhere
The advantage is that is the name is mispellt, cpp will warn us.

$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build

squash! build-sys: use #if Y instead of #ifdef Y everywhere

v2:
- fix incorrect setting of HAVE_LIBIDN2
2017-10-04 12:09:29 +02:00
Lennart Poettering 6c47cd7d3b execute: make StateDirectory= and friends compatible with DynamicUser=1 and RootDirectory=/RootImage=
Let's clean up the interaction of StateDirectory= (and friends) to
DynamicUser=1: instead of creating these directories directly below
/var/lib, place them in /var/lib/private instead if DynamicUser=1 is
set, making that directory 0700 and owned by root:root. This way, if a
dynamic UID is later reused, access to the old run's state directory is
prohibited for that user. Then, use file system namespacing inside the
service to make /var/lib/private a readable tmpfs, hiding all state
directories that are not listed in StateDirectory=, and making access to
the actual state directory possible. Mount all directories listed in
StateDirectory= to the same places inside the service (which means
they'll now be mounted into the tmpfs instance). Finally, add a symlink
from the state directory name in /var/lib/ to the one in
/var/lib/private, so that both the host and the service can access the
path under the same location.

Here's an example: let's say a service runs with StateDirectory=foo.
When DynamicUser=0 is set, it will get the following setup, and no
difference between what the unit and what the host sees:

        /var/lib/foo (created as directory)

Now, if DynamicUser=1 is set, we'll instead get this on the host:

        /var/lib/private (created as directory with mode 0700, root:root)
        /var/lib/private/foo (created as directory)
        /var/lib/foo → private/foo (created as symlink)

And from inside the unit:

        /var/lib/private (a tmpfs mount with mode 0755, root:root)
        /var/lib/private/foo (bind mounted from the host)
        /var/lib/foo → private/foo (the same symlink as above)

This takes inspiration from how container trees are protected below
/var/lib/machines: they generally reuse UIDs/GIDs of the host, but
because /var/lib/machines itself is set to 0700 host users cannot access
files in the container tree even if the UIDs/GIDs are reused. However,
for this commit we add one further trick: inside and outside of the unit
/var/lib/private is a different thing: outside it is a plain,
inaccessible directory, and inside it is a world-readable tmpfs mount
with only the whitelisted subdirs below it, bind mounte din.  This
means, from the outside the dir acts as an access barrier, but from the
inside it does not. And the symlink created in /var/lib/foo itself
points across the barrier in both cases, so that root and the unit's
user always have access to these dirs without knowing the details of
this mounting magic.

This logic resolves a major shortcoming of DynamicUser=1 units:
previously they couldn't safely store persistant data. With this change
they can have their own private state, log and data directories, which
they can write to, but which are protected from UID recycling.

With this change, if RootDirectory= or RootImage= are used it is ensured
that the specified state/log/cache directories are always mounted in
from the host. This change of semantics I think is much preferable since
this means the root directory/image logic can be used easily for
read-only resource bundling (as all writable data resides outside of the
image). Note that this is a change of behaviour, but given that we
haven't released any systemd version with StateDirectory= and friends
implemented this should be a safe change to make (in particular as
previously it wasn't clear what would actually happen when used in
combination). Moreover, by making this change we can later add a "+"
modifier to these setings too working similar to the same modifier in
ReadOnlyPaths= and friends, making specified paths relative to the
container itself.
2017-10-02 17:41:44 +02:00
Lennart Poettering a227a4be48 namespace: if we can create the destination of bind and PrivateTmp= mounts
When putting together the namespace, always create the file or directory
we are supposed to bind mount on, the same way we do it for most other
stuff, for example mount units or systemd-nspawn's --bind= option.

This has the big benefit that we can use namespace bind mounts on dirs
in /tmp or /var/tmp even in conjunction with PrivateTmp=.
2017-10-02 17:41:43 +02:00
Lennart Poettering e908468b5b namespace: properly handle bind mounts from the host
Before this patch we had an ordering problem: if we have no namespacing
enabled except for two bind mounts that intend to swap /a and /b via
bind mounts, then we'd execute the bind mount binding /b to /a, followed
by thebind mount from /a to /b, thus having the effect that /b is now
visible in both /a and /b, which was not intended.

With this change, as soon as any bind mount is configured we'll put
together the service mount namespace in a temporary directory instead of
operating directly in the root. This solves the problem in a
straightforward fashion: the source of bind mounts will always refer to
the host, and thus be unaffected from the bind mounts we already
created.
2017-10-02 17:41:43 +02:00
Lennart Poettering 645767d6b5 namespace: create /dev, /proc, /sys when needed
We already create /dev implicitly if PrivateTmp=yes is on, if it is
missing. Do so too for the other two API VFS, as well as for /dev if
PrivateTmp=yes is off but MountAPIVFS=yes is on (i.e. when /dev is bind
mounted from the host).
2017-10-02 17:41:43 +02:00
Topi Miettinen 07ce74074d namespace: avoid assertion failure (#6649)
If the root image is not decrypted, it must not be relinquished.
2017-08-29 17:31:24 +02:00
Nicolas Iooss 3a0bf6d6aa namespace: keep selinuxfs mounted read-write with ProtectKernelTunables (#5741)
When a service unit uses "ProtectKernelTunables=yes", it currently
remounts /sys/fs/selinux read-only. This makes libselinux report SELinux
state as "disabled", because most SELinux features are not usable. For
example it is not possible to validate security contexts (with
security_check_context_raw() or /sys/fs/selinux/context). This behavior
of libselinux has been described in
http://danwalsh.livejournal.com/73099.html and confirmed in a recent
email, https://marc.info/?l=selinux&m=149220233032594&w=2 .

Since commit 0c28d51ac8 ("units: further lock down our long-running
services"), systemd-localed unit uses ProtectKernelTunables=yes.
Nevertheless this service needs to use libselinux API in order to create
/etc/vconsole.conf, /etc/locale.conf... with the right SELinux contexts.
This is broken when /sys/fs/selinux is mounted read-only in the mount
namespace of the service.

Make SELinux-aware systemd services work again when they are using
ProtectKernelTunables=yes by keeping selinuxfs mounted read-write.
2017-07-31 17:45:33 +02:00
Timothée Ravier ac9de0b379 core: open /proc/self/mountinfo early to allow mounts over /proc (#5985)
Enable masking the /proc folder using the 'InaccessiblePaths' unit
option.

This also slightly simplify mounts setup as the bind_remount_recursive
function will only open /proc/self/mountinfo once.

This is based on the suggestion at:
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038634.html
2017-05-19 14:38:40 +02:00
Djalal Harouni 9c988f934b namespace: Apply MountAPIVFS= only when a Root directory is set
The MountAPIVFS= documentation says that this options has no effect
unless used in conjunction with RootDirectory= or RootImage= ,lets fix
this and avoid to create private mount namespaces where it is not
needed.
2017-03-05 21:39:43 +01:00
Djalal Harouni 10404d52e3 namespace: create base-filesystem directories if RootImage= or RootDirectory= are set
When a service is started with its own file system image, always try to
create the base-filesystem directories that are needed. This implicitly
covers the directories handled by MountAPIVFS= {/proc|/sys|/dev}.

Mount protections or MountAPIVFS= mounts were never applied if we
changed the root directory and the related paths were not present under
the new root. The mounts were silently. Fix this by creating those
directories if they are missing.

Closes https://github.com/systemd/systemd/issues/5488
2017-03-05 21:19:29 +01:00
AsciiWolf 13e785f7a0 Fix missing space in comments (#5439) 2017-02-24 18:14:02 +01:00