Commit graph

523 commits

Author SHA1 Message Date
Lennart Poettering e28c7cd066 tree-wide: set SA_RESTART for signal handlers we install
We already set it in most cases, but make sure to set it in all others too, and
document that that's a good idea.
2016-12-01 12:41:17 +01:00
Lennart Poettering ad85779a50 nspawn: split out overlayfs argument parsing into a function of its own
Add overlay_mount_parse() similar in style to tmpfs_mount_parse() and
bind_mount_parse().
2016-12-01 00:25:51 +01:00
Lennart Poettering 8d4aa2bb32 nspawn: make use of CHASE_NON_EXISTING when locking image
If --template= is used on an image, then the image might not exist initially.
We can use CHASE_NON_EXISTING to properly lock the image already before it
exists. Let's do so.
2016-12-01 00:25:51 +01:00
Lennart Poettering c4f4fce79e fs-util: add flags parameter to chase_symlinks()
Let's remove chase_symlinks_prefix() and instead introduce a flags parameter to
chase_symlinks(), with a flag CHASE_PREFIX_ROOT that exposes the behaviour of
chase_symlinks_prefix().
2016-12-01 00:25:51 +01:00
Lennart Poettering 8cd328d82e nspawn: accept --ephemeral --template= as alternative for --ephemeral --directory=
As suggested in PR #3667.

This PR simply ensures that --template= can be used as alternative to
--directory= when --ephemeral is used, following the logic that for ephemeral
options the source directory is actually a template.

This does not deprecate usage of --directory= with --ephemeral, as I am not
convinced the old logic wouldn't make sense.

Fixes: #3667
2016-12-01 00:25:51 +01:00
Lennart Poettering 3f342ec4b0 nspawn: properly handle image/directory paths that are symlinks
This resolves any paths specified on --directory=, --template=, and --image=
before using them. This makes sure nspawn can be used correctly on symlinked
images and directory trees.

Fixes: #2001
2016-12-01 00:25:51 +01:00
Lennart Poettering e187369587 tree-wide: stop using canonicalize_file_name(), use chase_symlinks() instead
Let's use chase_symlinks() everywhere, and stop using GNU
canonicalize_file_name() everywhere. For most cases this should not change
behaviour, however increase exposure of our function to get better tested. Most
importantly in a few cases (most notably nspawn) it can take the correct root
directory into account when chasing symlinks.
2016-12-01 00:25:51 +01:00
Lennart Poettering 17cbb288fa nspawn: add fallback top normal copy/reflink when we cannot btrfs snapshot
Given that other file systems (notably: xfs) support reflinks these days, let's
extend the file system snapshotting logic to fall back to plan copies or
reflinks when full btrfs subvolume snapshots are not available.

This essentially makes "systemd-nspawn --ephemeral" and "systemd-nspawn
--template=" available on non-btrfs subvolumes. Of course, both operations will
still be slower on non-btrfs than on btrfs (simply because reflinking each file
individually in a directory tree is still slower than doing this in one step
for a whole subvolume), but it's probably good enough for many cases, and we
should provide the users with the tools, they have to figure out what's good
for them.

Note that "machinectl clone" already had a fallback like this in place, this
patch generalizes this, and adds similar support to our other cases.
2016-11-22 13:35:09 +01:00
Lennart Poettering c67b008273 nspawn: remove temporary root directory on exit
When mountint a loopback image, we need a temporary root directory we can mount
stuff to. Make sure to actually remove it when exiting, so that we don't leave
stuff around in /tmp unnecessarily.

See: #4664
2016-11-22 13:35:09 +01:00
Lennart Poettering 6a0f896b97 nspawn: try to wait for the container PID 1 to exit, before we exit
Let's make the shutdown logic synchronous, so that there's a better chance to
detach the loopback device after use.
2016-11-22 13:35:09 +01:00
Lennart Poettering 0f3be6ca4d nspawn: support ephemeral boots from images
Previously --ephemeral was only supported with container trees in btrfs
subvolumes (i.e. in combination with --directory=). This adds support for
--ephemeral in conjunction with disk images (i.e. --image=) too.

As side effect this fixes that --ephemeral was accepted but ignored when using
-M on a container that turned out to be an image.

Fixes: #4664
2016-11-22 13:35:09 +01:00
Lennart Poettering f4ff4aa800 Merge pull request #4395 from s-urbaniak/rw-support
nspawn: R/W support for /sysfs, /proc, and /proc/sys/net
2016-11-18 12:36:46 +01:00
Sergiusz Urbaniak 4f086aab52
nspawn: R/W support for /sys, and /proc/sys
This commit adds the possibility to leave /sys, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sys, and /proc/sys will be read-write.
If set to "no", /sys, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
https://github.com/systemd/systemd/pull/4018#r76971862 where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
2016-11-18 09:50:40 +01:00
Zbigniew Jędrzejewski-Szmek 2a49b6120f nspawn: restart the whole systemd-nspawn@.service unit on container reboot (#4613)
Since 133 is now used in a few places, add a #define for it.
Also make the status message a bit informative.

Another issue introduced in b006762. The logic was borked, we were supposed
to return 0 to break the loop, and 133 to restart the container, not the other
way around.

But this doesn't seem to work, reboot fails with:
Nov 08 00:41:32 laptop systemd-nspawn[26564]: Failed to register machine: Machine 'fedora-rawhide' already exists
So actually the version before this patch worked better, since 133 > 0 and we'd
at least loop internally.
2016-11-14 11:49:49 +01:00
Christian Hesse 7debb05dbe nspawn: fix condition for mounting resolv.conf (#4622)
The file /usr/lib/systemd/resolv.conf can be stale, it does not tell us
whether or not systemd-resolved is running or not.
So check for /run/systemd/resolve/resolv.conf as well, which is created
at runtime and hence is a better indication.
2016-11-08 22:01:26 -05:00
Zbigniew Jędrzejewski-Szmek a809cee582 Merge pull request #4612 from keszybz/format-strings
Format string tweaks (and a small fix on 32bit)
2016-11-08 08:09:40 -05:00
Martin Pitt cfed63f60d nspawn: fix exit code for --help and --version (#4609)
Commit b006762 inverted the initial exit code which is relevant for --help and
--version without a particular reason.  For these special options, parse_argv()
returns 0 so that our main() immediately skips to the end without adjusting
"ret". Otherwise, if an actual container is being started, ret is set on error
in run(), which still provides the "non-zero exit on error" behaviour.

Fixes #4605.
2016-11-07 23:31:55 -05:00
Zbigniew Jędrzejewski-Szmek f97b34a629 Rename formats-util.h to format-util.h
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
2016-11-07 10:15:08 -05:00
Lennart Poettering 493fd52f1a Merge pull request #4510 from keszybz/tree-wide-cleanups
Tree wide cleanups
2016-11-03 13:59:20 -06:00
Lennart Poettering 2bce2acce8 nspawn: if we set up a loopback device, try to mount it with "discard"
Let's make sure that our loopback files remain sparse, hence let's set
"discard" as mount option on file systems that support it if the backing device
is a loopback.
2016-11-02 11:39:49 -06:00
Evgeny Vereshchagin 6d66bd3b2a nspawn: become a new root early
036d523641

> vfs: Don't create inodes with a uid or gid unknown to the vfs
  It is expected that filesystems can not represent uids and gids from
  outside of their user namespace.  Keep things simple by not even
  trying to create filesystem nodes with non-sense uids and gids.

So, we actually should `reset_uid_gid` early to prevent https://github.com/systemd/systemd/pull/4223#issuecomment-252522955

$ sudo UNIFIED_CGROUP_HIERARCHY=no LD_LIBRARY_PATH=.libs .libs/systemd-nspawn -D /var/lib/machines/fedora-rawhide -U -b systemd.unit=multi-user.target

Spawning container fedora-rawhide on /var/lib/machines/fedora-rawhide.
Press ^] three times within 1s to kill container.
Child died too early.
Selected user namespace base 1073283072 and range 65536.
Failed to mount to /sys/fs/cgroup/systemd: No such file or directory

Details: https://github.com/systemd/systemd/pull/4223#issuecomment-253046519
Fixes: #4352
2016-10-23 23:23:42 -04:00
Zbigniew Jędrzejewski-Szmek 605405c6cc tree-wide: drop NULL sentinel from strjoin
This makes strjoin and strjoina more similar and avoids the useless final
argument.

spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/systemd -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libsystemd/sd-bus -I ./src/libsystemd/sd-event -I ./src/libsystemd/sd-login -I ./src/libsystemd/sd-netlink -I ./src/libsystemd/sd-network -I ./src/libsystemd/sd-hwdb -I ./src/libsystemd/sd-device -I ./src/libsystemd/sd-id128 -I ./src/libsystemd-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/*.c)

git grep -e '\bstrjoin\b.*NULL' -l|xargs sed -i -r 's/strjoin\((.*), NULL\)/strjoin(\1)/'

This might have missed a few cases (spatch has a really hard time dealing
with _cleanup_ macros), but that's no big issue, they can always be fixed
later.
2016-10-23 11:43:27 -04:00
Zbigniew Jędrzejewski-Szmek 24597ee0e6 nspawn, NEWS: add missing "s" in --private-users-chown (#4438) 2016-10-21 06:03:26 +03:00
Evgeny Vereshchagin f0bef277a4 nspawn: cleanup and chown the synced cgroup hierarchy (#4223)
Fixes: #4181
2016-10-13 09:50:46 -04:00
Zbigniew Jędrzejewski-Szmek 60e76d4897 nspawn,mount-util: add [u]mount_verbose and use it in nspawn
This makes it easier to debug failed nspawn invocations:

Mounting sysfs on /var/lib/machines/fedora-rawhide/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV "")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev (MS_NOSUID|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev/shm (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=1777,uid=1450901504,gid=1450901504")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/run (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")...
Bind-mounting /sys/fs/selinux on /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_BIND "")...
Remounting /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Mounting proc on /proc (MS_NOSUID|MS_NOEXEC|MS_NODEV "")...
Bind-mounting /proc/sys on /proc/sys (MS_BIND "")...
Remounting /proc/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Bind-mounting /proc/sysrq-trigger on /proc/sysrq-trigger (MS_BIND "")...
Remounting /proc/sysrq-trigger (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Mounting tmpfs on /tmp (MS_STRICTATIME "mode=1777,uid=0,gid=0")...
Mounting tmpfs on /sys/fs/cgroup (MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_STRICTATIME "mode=755,uid=0,gid=0")...
Mounting cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr")...
Failed to mount cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr"): No such file or directory
2016-10-11 16:50:07 -04:00
Zbigniew Jędrzejewski-Szmek ada5412039 nspawn: simplify arg_us_cgns passing
We would check the condition cg_ns_supported() twice. No functional
change.
2016-10-11 16:46:58 -04:00
Lennart Poettering 6dca2fe325 Merge pull request #4332 from keszybz/nspawn-arguments-3
nspawn --private-users parsing, v2
2016-10-10 19:51:51 +02:00
Evgeny Vereshchagin a0f72a24e0 Merge pull request #4310 from keszybz/nspawn-autodetect
Autodetect systemd version in containers started by systemd-nspawn
2016-10-10 20:47:25 +03:00
Zbigniew Jędrzejewski-Szmek be7157316c nspawn: better error messages for parsing errors
In particular, the check for arg_uid_range <= 0 is moved to the end, so that
"foobar:0" gives "Failed to parse UID", and not "UID range cannot be 0.".
2016-10-10 11:55:06 -04:00
Zbigniew Jędrzejewski-Szmek ae209204d8 nspawn,man: fix parsing of numeric args for --private-users, accept any boolean
This is like the previous reverted commit, but any boolean is still accepted,
not just "yes" and "no". Man page is adjusted to match the code.
2016-10-10 11:55:06 -04:00
Zbigniew Jędrzejewski-Szmek 6c2058b35e Revert "nspawn: fix parsing of numeric arguments for --private-users"
This reverts commit bfd292ec35.
2016-10-10 11:17:40 -04:00
Zbigniew Jędrzejewski-Szmek bfd292ec35 nspawn: fix parsing of numeric arguments for --private-users
The documentation says lists "yes", "no", "pick", and numeric arguments.
But parse_boolean was attempted first, so various numeric arguments were
misinterpreted.

In particular, this fixes --private-users=0 to mean the same thing as
--private-users=0:65536.

While at it, use strndupa to avoid some error handling.
Also give a better error for an empty UID range. I think it's likely that
people will use --private-users=0:0 thinking that the argument means UID:GID.
2016-10-09 11:52:35 -04:00
Zbigniew Jędrzejewski-Szmek 27eb8e9028 nspawn: reindent table 2016-10-09 11:51:18 -04:00
Zbigniew Jędrzejewski-Szmek a8725a06e6 nspawn: also fall back to legacy cgroup hierarchy for old containers
Current systemd version detection routine cannot detect systemd 230,
only systmed >= 231. This means that we'll still use the legacy hierarchy
in some cases where we wouldn't have too. If somebody figures out a nice
way to detect systemd 230 this can be later improved.
2016-10-08 19:03:53 -04:00
Zbigniew Jędrzejewski-Szmek 0fd9563fde nspawn: use mixed cgroup hierarchy only when container has new systemd
systemd-soon-to-be-released-232 is able to deal with the mixed hierarchy.
So make an educated guess, and use the mixed hierarchy in that case.

Tested by running the host with mixed hierarchy (i.e. simply using a recent
kernel with systemd from git), and booting first a container with older systemd,
and then one with a newer systemd.

Fixes #4008.
2016-10-08 14:48:41 -04:00
Zbigniew Jędrzejewski-Szmek 27e29a1e43 nspawn: fix spurious reboot if container process returns 133 2016-10-08 14:48:41 -04:00
Zbigniew Jędrzejewski-Szmek b006762524 nspawn: move the main loop body out to a new function
The new function has 416 lines by itself!

"return log_error_errno" is used to nicely reduce the volume of error
handling code.

A few minor issues are fixed on the way:
- positive value was used as error value (EIO), causing systemd-nspawn
  to return success, even though it shouldn't.
- In two places random values were used as error status, when the
  actual value was in an unusual place (etc_password_lock, notify_socket).

Those are the only functional changes.

There is another potential issue, which is marked with a comment, and left
unresolved: the container can also return 133 by itself, causing a spurious
reboot.
2016-10-08 14:48:41 -04:00
Zbigniew Jędrzejewski-Szmek 98afd6af3a nspawn: check env var first, detect second
If we are going to use the env var to override the detection result
anyway, there is not point in doing the detection, especially that
it can fail.
2016-10-08 14:48:41 -04:00
Lennart Poettering 7429b2eb83 tree-wide: drop some misleading compiler warnings
gcc at some optimization levels thinks thes variables were used without
initialization. it's wrong, but let's make the message go anyway.
2016-10-06 19:04:10 +02:00
Djalal Harouni 41eb436265 nspawn: add log message to let users know that nspawn needs an empty /dev directory (#4226)
Fixes https://github.com/systemd/systemd/issues/3695

At the same time it adds a protection against userns chown of inodes of
a shared mount point.
2016-10-05 06:57:02 +02:00
Alban Crequy 19caffac75 nspawn: set shared propagation mode for the container 2016-10-03 14:19:27 +02:00
Evgeny Vereshchagin cc238590e4 Merge pull request #4185 from endocode/djalal-sandbox-first-protection-v1
core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes
2016-09-28 04:50:30 +03:00
Torstein Husebø d23a0044a3 treewide: fix typos (#4217) 2016-09-26 11:32:47 +02:00
Lennart Poettering 6b7c9f8bce namespace: rework how ReadWritePaths= is applied
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths=
specification, then we'd first recursively apply the ReadOnlyPaths= paths, and
make everything below read-only, only in order to then flip the read-only bit
again for the subdirs listed in ReadWritePaths= below it.

This is not only ugly (as for the dirs in question we first turn on the RO bit,
only to turn it off again immediately after), but also problematic in
containers, where a container manager might have marked a set of dirs read-only
and this code will undo this is ReadWritePaths= is set for any.

With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be
applied to the children listed in ReadWritePaths= in the first place, so that
we do not need to turn off the RO bit for those after all.

This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the
RO bit, but never to turn it off again. Or to say this differently: if some
dirs are marked read-only via some external tool, then ReadWritePaths= will not
undo it.

This is not only the safer option, but also more in-line with what the man page
currently claims:

        "Entries (files or directories) listed in ReadWritePaths= are
        accessible from within the namespace with the same access rights as
        from outside."

To implement this change bind_remount_recursive() gained a new "blacklist"
string list parameter, which when passed may contain subdirs that shall be
excluded from the read-only mounting.

A number of functions are updated to add more debug logging to make this more
digestable.
2016-09-25 10:40:51 +02:00
Luca Bruno 48a8d337a6 nspawn: decouple --boot from CLONE_NEWIPC (#4180)
This commit is a minor tweak after the split of `--share-system`, decoupling the `--boot`
option from IPC namespacing.

Historically there has been a single `--share-system` option for sharing IPC/PID/UTS with the
host, which was incompatible with boot/pid1 mode. After the split, it is now possible to express
the requirements with better granularity.

For reference, this is a followup to #4023 which contains references to previous discussions.
I realized too late that CLONE_NEWIPC is not strictly needed for boot mode.
2016-09-24 08:30:42 -04:00
Michael Pope 21dc02277d nspawn: fix comment typo in setup_timezone example (#4183) 2016-09-20 07:30:48 +02:00
Michael Pope 0b493a0263 nspawn: clarify log warning for /etc/localtime not being a symbolic link (#4163) 2016-09-17 09:59:28 +02:00
Luca Bruno 0c582db0c6 nspawn: split down SYSTEMD_NSPAWN_SHARE_SYSTEM (#4023)
This commit follows further on the deprecation path for --share-system,
by splitting and gating each share-able namespace behind its own
environment flag.
2016-08-26 00:08:26 +02:00
Tejun Heo 5da38d0768 core: use the unified hierarchy for the systemd cgroup controller hierarchy
Currently, systemd uses either the legacy hierarchies or the unified hierarchy.
When the legacy hierarchies are used, systemd uses a named legacy hierarchy
mounted on /sys/fs/cgroup/systemd without any kernel controllers for process
management.  Due to the shortcomings in the legacy hierarchy, this involves a
lot of workarounds and complexities.

Because the unified hierarchy can be mounted and used in parallel to legacy
hierarchies, there's no reason for systemd to use a legacy hierarchy for
management even if the kernel resource controllers need to be mounted on legacy
hierarchies.  It can simply mount the unified hierarchy under
/sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies.
This disables a significant amount of fragile workaround logics and would allow
using features which depend on the unified hierarchy membership such bpf cgroup
v2 membership test.  In time, this would also allow deleting the said
complexities.

This patch updates systemd so that it prefers the unified hierarchy for the
systemd cgroup controller hierarchy when legacy hierarchies are used for kernel
resource controllers.

* cg_unified(@controller) is introduced which tests whether the specific
  controller in on unified hierarchy and used to choose the unified hierarchy
  code path for process and service management when available.  Kernel
  controller specific operations remain gated by cg_all_unified().

* "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to
  force the use of legacy hierarchy for systemd cgroup controller.

* nspawn: By default nspawn uses the same hierarchies as the host.  If
  UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all.  If
  0, legacy for all.

* nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of
  three options - legacy, only systemd controller on unified, and unified.  The
  value is passed into mount setup functions and controls cgroup configuration.

* nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount
  option is moved to mount_legacy_cgroup_hierarchy() so that it can take an
  appropriate action depending on the configuration of the host.

v2: - CGroupUnified enum replaces open coded integer values to indicate the
      cgroup operation mode.
    - Various style updates.

v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2.

v4: Restored legacy container on unified host support and fixed another bug in
    detect_unified_cgroup_hierarchy().
2016-08-17 17:44:36 -04:00
Tejun Heo ca2f6384aa core: rename cg_unified() to cg_all_unified()
A following patch will update cgroup handling so that the systemd controller
(/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel
resource controllers are on the legacy hierarchies.  This would require
distinguishing whether all controllers are on cgroup v2 or only the systemd
controller is.  In preparation, this patch renames cg_unified() to
cg_all_unified().

This patch doesn't cause any functional changes.
2016-08-15 18:13:36 -04:00