Commit graph

184 commits

Author SHA1 Message Date
Chris Lamb 3fe910794b Correct a number of trivial typos. 2018-06-18 22:44:44 +02:00
Yu Watanabe 1e8c7bd55c namespace: drop protect_{home,system}_or_bool_from_string()
The functions protect_{home,system}_from_string() are not used
except for defining protect_{home,system}_or_bool_from_string().
This makes protect_{home,system}_from_string() support boolean
strings, and drops protect_{home,system}_or_bool_from_string().
2018-06-15 11:32:27 +02:00
Zbigniew Jędrzejewski-Szmek b0450864f1
Merge pull request #9274 from poettering/comment-header-cleanup
drop "this file is part of systemd" and lennart's copyright from header
2018-06-14 11:26:50 +02:00
Jan Synacek 0722b35934 namespace: always use a root directory when setting up namespace
1) mv /var/tmp /var/tmp.old
2) mkdir /tmp/varrr
3) ln -s /tmp/varrr /var/tmp

Now, when a service has PrivateTmp=yes, during namespace setup,
/tmp is first mounted over with a new mount. Then, when /var/tmp
is being resolved, it points to /tmp/varrr, which by then doesn't
exist, because it had already been obscured.
2018-06-14 10:25:16 +02:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Zbigniew Jędrzejewski-Szmek 5d904a6aaa tree-wide: drop !! casts to booleans
They are not needed, because anything that is non-zero is converted
to true.

C11:
> 6.3.1.2: When any scalar value is converted to _Bool, the result is 0 if the
> value compares equal to 0; otherwise, the result is 1.

https://stackoverflow.com/questions/31551888/casting-int-to-bool-in-c-c
2018-06-13 10:52:40 +02:00
Lennart Poettering 228af36fff core: add new PrivateMounts= unit setting
This new setting is supposed to be useful in most cases where
"MountFlags=slave" is currently used, i.e. as an explicit way to run a
service in its own mount namespace and decouple propagation from all
mounts of the new mount namespace towards the host.

The effect of MountFlags=slave and PrivateMounts=yes is mostly the same,
as both cause a CLONE_NEWNS namespace to be opened, and both will result
in all mounts within it to be mounted MS_SLAVE. The difference is mostly
on the conceptual/philosophical level: configuring the propagation mode
is nothing people should have to think about, in particular as the
matter is not precisely easyto grok. Moreover, MountFlags= allows configuration
of "private" and "slave" modes which don't really make much sense to use
in real-life and are quite confusing. In particular PrivateMounts=private means
mounts made on the host stay pinned for good by the service which is
particularly nasty for removable media mount. And PrivateMounts=shared
is in most ways a NOP when used a alone...

The main technical difference between setting only MountFlags=slave or
only PrivateMounts=yes in a unit file is that the former remounts all
mounts to MS_SLAVE and leaves them there, while that latter remounts
them to MS_SHARED again right after. The latter is generally a nicer
approach, since it disables propagation, while MS_SHARED is afterwards
in effect, which is really nice as that means further namespacing down
the tree will get MS_SHARED logic by default and we unify how
applications see our mounts as we always pass them as MS_SHARED
regardless whether any mount namespacing is used or not.

The effect of PrivateMounts=yes was implied already by all the other
mount namespacing options. With this new option we add an explicit knob
for it, to request it without any other option used as well.

See: #4393
2018-06-12 16:12:10 +02:00
Yu Watanabe fa65c28176 namespace: rename parse_protect_{home,system}_or_bool() to protect_{home,system}_or_bool_to_string()
Hence, we can define config_parse_protect_{home,system}() by using
DEFINE_CONFIG_PARSE_ENUM() macro.
2018-05-31 11:09:41 +09:00
Lennart Poettering 4e2c0a227e namespace: extend list of masked files by ProtectKernelTunables=
This adds a number of entries nspawn already applies to regular service
namespacing too. Most importantly let's mask /proc/kcore and
/proc/kallsyms too.
2018-05-03 17:46:31 +02:00
Lennart Poettering da6053d0a7 tree-wide: be more careful with the type of array sizes
Previously we were a bit sloppy with the index and size types of arrays,
we'd regularly use unsigned. While I don't think this ever resulted in
real issues I think we should be more careful there and follow a
stricter regime: unless there's a strong reason not to use size_t for
array sizes and indexes, size_t it should be. Any allocations we do
ultimately will use size_t anyway, and converting forth and back between
unsigned and size_t will always be a source of problems.

Note that on 32bit machines "unsigned" and "size_t" are equivalent, and
on 64bit machines our arrays shouldn't grow that large anyway, and if
they do we have a problem, however that kind of overly large allocation
we have protections for usually, but for overflows we do not have that
so much, hence let's add it.

So yeah, it's a story of the current code being already "good enough",
but I think some extra type hygiene is better.

This patch tries to be comprehensive, but it probably isn't and I missed
a few cases. But I guess we can cover that later as we notice it. Among
smaller fixes, this changes:

1. strv_length()' return type becomes size_t

2. the unit file changes array size becomes size_t

3. DNS answer and query array sizes become size_t

Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=76745
2018-04-27 14:29:06 +02:00
Lennart Poettering 088696fe29 namespace: rework how we resolve symlinks in mount points
Before this patch we'd resolve all symlinks of bind mounts and other
mount points to establish for a service in advance, and only then start
mounting them. This is problematic, if symlink chains jump around
between directories in a namespace tree, so that to resolve a specific
symlink chain we need to establish another mount already. A typical case
where this happens is if /etc/resolv.conf is a symlink to some file in
/run: in that case we'd normally resolve and mount /etc/resolv.conf
early on, but that's broken, as to do this properly we'd need to resolve
/etc/resolv.conf first, then figure out that /run needs to be mounted
before we can proceed, and thus reorder the order in which we apply
mounts dynamically.

With this change, whenever we are about to apply a mount, we'll do a
single step of the symlink normalization process, patch the mount entry
accordingly, and then sort the list of mounts to establish again, taking
the new path into account. This means that we can correctly deal with
the example above: we might start with wanting to mount /etc/resolv.conf
early, but after resolving it to the path in /run/ we'd push it to the
end of the list, ensuring that /run is mounted first.

(Note that this also fixes another bug: we were following symlinks on
the bind mount source relative to the root directory of the service,
rather than of the host. That's wrong though as we explicitly document
tha the source of bind mounts is always on the host.)
2018-04-18 14:17:50 +02:00
Lennart Poettering e871786273 namespace: improve logging when creating mount source nodes 2018-04-18 14:15:48 +02:00
Lennart Poettering f8b64b5723 namespace: split out calls to normalize mount entry list into new function 2018-04-18 14:15:48 +02:00
Lennart Poettering c9ef8573be namespace: don't consider raw image read-only if /home in it is writable 2018-04-18 14:15:48 +02:00
Lennart Poettering 12777909c9
Merge pull request #8417 from brauner/2018-03-09/add_bind_mount_fallback_to_private_devices
core: fall back to bind-mounts for PrivateDevices= execution environments
2018-04-18 11:56:56 +02:00
Zbigniew Jędrzejewski-Szmek af984e137e core/namespace: rework the return semantics of clone_device_node yet again
Returning 0 on not-found/wrong-type is confusing. Let's return -ENXIO in that
case instead, and explicitly ignore it in the call site where we want to do that.
I think this is clearer and less likely to be used errenously in case another
call site is added.

C.f. 152c475f95 and 98b1d2b8d9.
2018-04-12 18:15:33 +02:00
Christian Brauner 1649861744 core: fall back to bind-mounts for PrivateDevices= execution environments
In environments where CAP_MKNOD is not available or inside
user namespaces it is still desirable to enable services to use
PrivateDevices= . So fall back to using bind-mounts on EPERM.
2018-04-12 18:15:12 +02:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Yu Watanabe 1cc6c93a95 tree-wide: use TAKE_PTR() and TAKE_FD() macros 2018-04-05 14:26:26 +09:00
Lennart Poettering 62570f6f03 fs-util: add new CHASE_TRAIL_SLASH flag for chase_symlinks()
This rearranges chase_symlinks() a bit: if no special flags are
specified it will now revert to behaviour before
b12d25a8d6. However, if the new
CHASE_TRAIL_SLASH flag is specified it will follow the behaviour
introduced by that commit.

I wasn't sure which one to make the beaviour that requires specification
of a flag to enable. I opted to make the "append trailing slash"
behaviour the one to enable by a flag, following the thinking that the
function should primarily be used to generate a normalized path, and I
am pretty sure a path without trailing slash is the more "normalized"
one, as the trailing slash is not really a part of it, but merely a
"decorator" that tells various system calls to generate ENOTDIR if the
path doesn't refer to a path.

Or to say this differently: if the slash was part of normalization then
we really should add it in all cases when the final path is a directory,
not just when the user originally specified it.

Fixes: #8544
Replaces: #8545
2018-03-22 19:54:24 +01:00
Zbigniew Jędrzejewski-Szmek 671f0f8de0 Remove /sbin from paths if split-bin is false (#8324)
Follow-up for 157baa87e4.
2018-03-01 21:48:36 +01:00
Ansgar Burchardt 7486f305cd Include additional directories in ProtectSystem 2018-02-27 18:56:19 -03:00
Zbigniew Jędrzejewski-Szmek aa484f3561 tree-wide: use reallocarray instead of our home-grown realloc_multiply (#8279)
There isn't much difference, but in general we prefer to use the standard
functions. glibc provides reallocarray since version 2.26.

I moved explicit_bzero is configure test to the bottom, so that the two stdlib
functions are at the bottom.
2018-02-26 21:20:00 +01:00
Lennart Poettering 13a141f046 namespace: protect bpf file system as part of ProtectKernelTunables=
It also exposes kernel objects, let's better include this in
ProtectKernelTunables=.
2018-02-21 16:43:36 +01:00
Yu Watanabe e4da7d8c79 core: add new option 'tmpfs' to ProtectHome=
This make ProtectHome= setting can take 'tmpfs'. This is mostly
equivalent to `TemporaryFileSystem=/home /run/user /root`.
2018-02-21 09:18:17 +09:00
Yu Watanabe 2abd4e388a core: add new setting TemporaryFileSystem=
This introduces a new setting TemporaryFileSystem=. This is useful
to hide files not relevant to the processes invoked by unit, while
necessary files or directories can be still accessed by combining
with Bind{,ReadOnly}Paths=.
2018-02-21 09:17:52 +09:00
Yu Watanabe 4ca763a902 core/namespace: make '-' prefix in Bind{,ReadOnly}Paths= work
Each path in `Bind{ReadOnly}Paths=` accept '-' prefix. However,
the prefix is completely ignored.
This makes it work as expected.
2018-02-21 09:07:56 +09:00
Yu Watanabe f5c52a7724 core/namespace: remove unused argument 2018-02-21 09:05:30 +09:00
Yu Watanabe e282f51f57 core/namespace: use free_and_replace() 2018-02-21 09:05:21 +09:00
Yu Watanabe 55fe743273 core/namespace: fix comment 2018-02-21 09:05:18 +09:00
Yu Watanabe 89bd586cd3 core/namespace: merge PRIVATE_VAR_TMP into PRIVATE_TMP 2018-02-21 09:05:16 +09:00
Yu Watanabe 2a2969fd5d core/namespace: make arguments const if possible 2018-02-21 09:05:14 +09:00
Zbigniew Jędrzejewski-Szmek f863b1c6fa core: move very long argument to a separate statement
I like compact, but this was a bit too much.
2018-02-15 10:10:01 +01:00
Lennart Poettering 152c475f95 namepace: fix error handling when clone_device_node() returns 0
Before this patch, we'd treat clone_device_node() returning 0 (as
opposed to 1) as error, but then propagate this non-error result in
confusion.

This makes sure that if we ptmx isn't around we propagate that as
-ENXIO.

This is a follow-up for 98b1d2b8d9
2018-01-23 19:50:32 +01:00
Lennart Poettering 36ce7110b0 namespace: use is_symlink() helper
We have this prett ylittle helper, let's use it, it makes things a tiny
bit more readable.
2018-01-23 19:36:55 +01:00
Lennart Poettering 6f7f3a3351 namespace: use stack allocation for paths, where we can 2018-01-23 19:36:36 +01:00
Alan Jenkins 68f7480b7e
Merge pull request #7913 from sourcejedi/devpts
3 nitpicks from core/namespace.c
2018-01-18 21:56:26 +00:00
Alan Jenkins 225874dc9c core: clone_device_node(): add debug message
For people who use debug messages, maybe it is helpful to know that
PrivateDevices= failed due to mknod(), and which device node.

(The other (un-logged) failures could be while mounting filesystems e.g. no
CAP_SYS_ADMIN which is the common case, or missing /dev/shm or /dev/pts,
or missing /dev/ptmx).
2018-01-18 13:58:13 +00:00
Alan Jenkins 8d95368210 core: namespace: remove unnecessary mode on /dev/shm mount target
This should have no behavioural effect; it just confused me.

All the other mount directories in this function are created as 0755.
Some of the mounts are allowed to fail - mqueue and hugepages.
If the /dev/mqueue mount target was created with the permissive mode 01777,
to match the filesystem we're trying to mount there, then a mount failure
would allow unprivileged users to write to the /dev filesystem, e.g. to
exhaust the available space.  There is no reason to allow this.

(Allowing the user read access (0755) seems a reasonable idea though, e.g. for
quicker troubleshooting.)

We do not allow failure of the /dev/shm mount, so it doesn't matter that
it is created as 01777.  But on the same grounds, we have no *reason* to
create it as any specific mode.  0755 is equally fine.

This function will be clearer by using 0755 throughout, to avoid
unintentionally implying some connection between the mode of the mount
target, and the mode of the mounted filesystem.
2018-01-17 18:04:34 +00:00
Alan Jenkins 98b1d2b8d9 core: namespace: nitpick /dev/ptmx error handling
If /dev/tty did not exist, or had st_rdev == 0, we ignored it.  And the
same is true for null, zero, full, random, urandom.

If /dev/ptmx did not exist, we treated this as a failure.  If /dev/ptmx had
st_rdev == 0, we ignored it.

This was a very recent change, but there was no reason for ptmx creation
specifically to treat st_rdev == 0 differently from non-existence.  This
confuses me when reading it.

Change the creation of /dev/ptmx so that st_rdev == 0 is
treated as failure.

This still leaves /dev/ptmx as a special case with stricter handling.
However it is consistent with the immediately preceding creation of
/dev/pts/, which is treated as essential, and is directly related to ptmx.

I don't know why we check st_rdev.  But I'd prefer to have only one
unanswered question here, and not to have a second unanswered question
added on top.
2018-01-17 13:28:32 +00:00
Дамјан Георгиевски 414b304ba2 namespace: only make the symlink /dev/ptmx if it was already a symlink
…otherwise try to clone it as a device node

On most contemporary distros /dev/ptmx is a device node, and
/dev/pts/ptmx has 000 inaccessible permissions. In those cases
the symlink /dev/ptmx -> /dev/pts/ptmx breaks the pseudo tty support.

In that case we better clone the device node.

OTOH, in nspawn containers (and possibly others), /dev/pts/ptmx has
normal permissions, and /dev/ptmx is a symlink. In that case make the
same symlink.

fixes #7878
2018-01-17 01:19:46 +01:00
Дамјан Георгиевски b5e99f23ed namespace: extract clone_device_node function from mount_private_dev 2018-01-16 21:41:10 +01:00
Yu Watanabe 03c791aa24 namespace: introduce parse_protect_system()_or_bool 2018-01-02 02:23:13 +09:00
Yu Watanabe 5e1c61544c namespace: introduce parse_protect_home_or_bool() 2018-01-02 02:23:05 +09:00
Lennart Poettering 2d3a5a73e0 nspawn: make sure images containing an ESP are compatible with userns -U mode
In -U mode we might need to re-chown() all files and directories to
match the UID shift we want for the image. That's problematic on fat
partitions, such as the ESP (and which is generated by mkosi's
--bootable switch), because fat of course knows no UID/GID file
ownership natively.

With this change we take benefit of the uid= and gid= mount options FAT
knows: instead of chown()ing all files and directories we can just
specify the right UID/GID to use at mount time.

This beefs up the image dissection logic in two ways:

1. First of all support for mounting relevant file systems with
   uid=/gid= is added: when a UID is specified during mount it is used for
   all applicable file systems.

2. Secondly, two new mount flags are added:
   DISSECT_IMAGE_MOUNT_ROOT_ONLY and DISSECT_IMAGE_MOUNT_NON_ROOT_ONLY.
   If one is specified the mount routine will either only mount the root
   partition of an image, or all partitions except the root partition.
   This is used by nspawn: first the root partition is mounted, so that
   we can determine the UID shift in use so far, based on ownership of
   the image's root directory. Then, we mount the remaining partitions
   in a second go, this time with the right UID/GID information.
2017-12-05 13:49:12 +01:00
Shawn Landden 4831981d89 tree-wide: adjust fall through comments so that gcc is happy
Distcc removes comments, making the comment silencing
not work.

I know there was a decision against a macro in commit
ec251fe7d5
2017-11-20 13:06:25 -08:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Lennart Poettering 4e0c20de97 namespace: set up OS hierarchy only after mounting the new root, not before
Otherwise it's a pointless excercise, as we'll set up an empty directory
tree that's never going to be used.

Hence, let's move this around a bit, so that we do the basesystem
initialization exactly when RootImage= or RootDirectory= are used, but
not otherwise.
2017-11-13 10:22:36 +01:00
Yu Watanabe d18aff0422 core: ReadWritePaths= and friends assume '+' prefix when BindPaths= or freinds are set
When at least one of BindPaths=, BindReadOnlyPaths=, RootImage=,
RuntimeDirectory= or their friends are set, systemd prepares
a namespace under /run/systemd/unit-root. Thus, ReadWritePaths=
or their friends without '+' prefix is completely meaningless.
So, let's assume '+' prefix when one of them are set.

Fixes #7070 and #7080.
2017-11-08 15:48:01 +09:00