Commit graph

488 commits

Author SHA1 Message Date
Zbigniew Jędrzejewski-Szmek 27e29a1e43 nspawn: fix spurious reboot if container process returns 133 2016-10-08 14:48:41 -04:00
Zbigniew Jędrzejewski-Szmek b006762524 nspawn: move the main loop body out to a new function
The new function has 416 lines by itself!

"return log_error_errno" is used to nicely reduce the volume of error
handling code.

A few minor issues are fixed on the way:
- positive value was used as error value (EIO), causing systemd-nspawn
  to return success, even though it shouldn't.
- In two places random values were used as error status, when the
  actual value was in an unusual place (etc_password_lock, notify_socket).

Those are the only functional changes.

There is another potential issue, which is marked with a comment, and left
unresolved: the container can also return 133 by itself, causing a spurious
reboot.
2016-10-08 14:48:41 -04:00
Zbigniew Jędrzejewski-Szmek 98afd6af3a nspawn: check env var first, detect second
If we are going to use the env var to override the detection result
anyway, there is not point in doing the detection, especially that
it can fail.
2016-10-08 14:48:41 -04:00
Lennart Poettering 7429b2eb83 tree-wide: drop some misleading compiler warnings
gcc at some optimization levels thinks thes variables were used without
initialization. it's wrong, but let's make the message go anyway.
2016-10-06 19:04:10 +02:00
Djalal Harouni 41eb436265 nspawn: add log message to let users know that nspawn needs an empty /dev directory (#4226)
Fixes https://github.com/systemd/systemd/issues/3695

At the same time it adds a protection against userns chown of inodes of
a shared mount point.
2016-10-05 06:57:02 +02:00
Alban Crequy 19caffac75 nspawn: set shared propagation mode for the container 2016-10-03 14:19:27 +02:00
Evgeny Vereshchagin cc238590e4 Merge pull request #4185 from endocode/djalal-sandbox-first-protection-v1
core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes
2016-09-28 04:50:30 +03:00
Torstein Husebø d23a0044a3 treewide: fix typos (#4217) 2016-09-26 11:32:47 +02:00
Lennart Poettering 6b7c9f8bce namespace: rework how ReadWritePaths= is applied
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths=
specification, then we'd first recursively apply the ReadOnlyPaths= paths, and
make everything below read-only, only in order to then flip the read-only bit
again for the subdirs listed in ReadWritePaths= below it.

This is not only ugly (as for the dirs in question we first turn on the RO bit,
only to turn it off again immediately after), but also problematic in
containers, where a container manager might have marked a set of dirs read-only
and this code will undo this is ReadWritePaths= is set for any.

With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be
applied to the children listed in ReadWritePaths= in the first place, so that
we do not need to turn off the RO bit for those after all.

This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the
RO bit, but never to turn it off again. Or to say this differently: if some
dirs are marked read-only via some external tool, then ReadWritePaths= will not
undo it.

This is not only the safer option, but also more in-line with what the man page
currently claims:

        "Entries (files or directories) listed in ReadWritePaths= are
        accessible from within the namespace with the same access rights as
        from outside."

To implement this change bind_remount_recursive() gained a new "blacklist"
string list parameter, which when passed may contain subdirs that shall be
excluded from the read-only mounting.

A number of functions are updated to add more debug logging to make this more
digestable.
2016-09-25 10:40:51 +02:00
Luca Bruno 48a8d337a6 nspawn: decouple --boot from CLONE_NEWIPC (#4180)
This commit is a minor tweak after the split of `--share-system`, decoupling the `--boot`
option from IPC namespacing.

Historically there has been a single `--share-system` option for sharing IPC/PID/UTS with the
host, which was incompatible with boot/pid1 mode. After the split, it is now possible to express
the requirements with better granularity.

For reference, this is a followup to #4023 which contains references to previous discussions.
I realized too late that CLONE_NEWIPC is not strictly needed for boot mode.
2016-09-24 08:30:42 -04:00
Michael Pope 21dc02277d nspawn: fix comment typo in setup_timezone example (#4183) 2016-09-20 07:30:48 +02:00
Michael Pope 0b493a0263 nspawn: clarify log warning for /etc/localtime not being a symbolic link (#4163) 2016-09-17 09:59:28 +02:00
Luca Bruno 0c582db0c6 nspawn: split down SYSTEMD_NSPAWN_SHARE_SYSTEM (#4023)
This commit follows further on the deprecation path for --share-system,
by splitting and gating each share-able namespace behind its own
environment flag.
2016-08-26 00:08:26 +02:00
Tejun Heo 5da38d0768 core: use the unified hierarchy for the systemd cgroup controller hierarchy
Currently, systemd uses either the legacy hierarchies or the unified hierarchy.
When the legacy hierarchies are used, systemd uses a named legacy hierarchy
mounted on /sys/fs/cgroup/systemd without any kernel controllers for process
management.  Due to the shortcomings in the legacy hierarchy, this involves a
lot of workarounds and complexities.

Because the unified hierarchy can be mounted and used in parallel to legacy
hierarchies, there's no reason for systemd to use a legacy hierarchy for
management even if the kernel resource controllers need to be mounted on legacy
hierarchies.  It can simply mount the unified hierarchy under
/sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies.
This disables a significant amount of fragile workaround logics and would allow
using features which depend on the unified hierarchy membership such bpf cgroup
v2 membership test.  In time, this would also allow deleting the said
complexities.

This patch updates systemd so that it prefers the unified hierarchy for the
systemd cgroup controller hierarchy when legacy hierarchies are used for kernel
resource controllers.

* cg_unified(@controller) is introduced which tests whether the specific
  controller in on unified hierarchy and used to choose the unified hierarchy
  code path for process and service management when available.  Kernel
  controller specific operations remain gated by cg_all_unified().

* "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to
  force the use of legacy hierarchy for systemd cgroup controller.

* nspawn: By default nspawn uses the same hierarchies as the host.  If
  UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all.  If
  0, legacy for all.

* nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of
  three options - legacy, only systemd controller on unified, and unified.  The
  value is passed into mount setup functions and controls cgroup configuration.

* nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount
  option is moved to mount_legacy_cgroup_hierarchy() so that it can take an
  appropriate action depending on the configuration of the host.

v2: - CGroupUnified enum replaces open coded integer values to indicate the
      cgroup operation mode.
    - Various style updates.

v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2.

v4: Restored legacy container on unified host support and fixed another bug in
    detect_unified_cgroup_hierarchy().
2016-08-17 17:44:36 -04:00
Tejun Heo ca2f6384aa core: rename cg_unified() to cg_all_unified()
A following patch will update cgroup handling so that the systemd controller
(/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel
resource controllers are on the legacy hierarchies.  This would require
distinguishing whether all controllers are on cgroup v2 or only the systemd
controller is.  In preparation, this patch renames cg_unified() to
cg_all_unified().

This patch doesn't cause any functional changes.
2016-08-15 18:13:36 -04:00
Lennart Poettering 07a1734a13 Merge pull request #3885 from keszybz/help-output
Update help for "short-full" and shorten to 80 columns
2016-08-04 16:11:38 +02:00
Zbigniew Jędrzejewski-Szmek 90b4a64d77 nspawn,resolve: short --help output to fit within 80 columns
make dist-check-help FTW!
2016-08-04 09:03:42 -04:00
Lennart Poettering f7b7b3df9e nspawn: if we can't mark the boot ID RO let's fail
It's probably better to be safe here.
2016-08-03 14:52:16 +02:00
Lennart Poettering a6b5216c7c nspawn: deprecate --share-system support
This removes the --share-system switch: from the documentation, the --help text
as well as the command line parsing. It's an ugly option, given that it kinda
contradicts the whole concept of PID namespaces that nspawn implements. Since
it's barely ever used, let's just deprecate it and remove it from the options.

It might be useful as a debugging option, hence the functionality is kept
around for now, exposed via an undocumented $SYSTEMD_NSPAWN_SHARE_SYSTEM
environment variable.
2016-08-03 14:52:16 +02:00
Lennart Poettering 3539724c26 nspawn: try to bind mount resolved's resolv.conf snippet into the container
This has the benefit that the container can follow the host's DNS server
changes without us having to constantly update the container's resolv.conf
settings.
2016-08-03 14:52:16 +02:00
Christian Brauner 5a8ff0e61d nspawn: add SYSTEMD_NSPAWN_USE_CGNS env variable (#3809)
SYSTEMD_NSPAWN_USE_CGNS allows to disable the use of cgroup namespaces.
2016-07-26 16:49:15 +02:00
Zbigniew Jędrzejewski-Szmek e28973ee18 Merge pull request #3757 from poettering/efi-search 2016-07-25 16:34:18 -04:00
Lennart Poettering 1a0b98c437 Merge pull request #3589 from brauner/cgroup_namespace
Cgroup namespace
2016-07-25 22:23:00 +02:00
Zbigniew Jędrzejewski-Szmek 476b8254d9 nspawn: don't skip cleanup on locking error 2016-07-22 21:25:09 -04:00
Lennart Poettering 15b1248a6b machine-id-setup: port machine_id_commit() to new id128-util.c APIs 2016-07-22 12:59:36 +02:00
Lennart Poettering 317feb4d9f nspawn: rework /etc/machine-id handling
With this change we'll no longer write to /etc/machine-id from nspawn, as that
breaks the --volatile= operation, as it ensures the image is never considered
in "first boot", since that's bound to the pre-existance of /etc/machine-id.

The new logic works like this:

- If /etc/machine-id already exists in the container, it is read by nspawn and
  exposed in "machinectl status" and friends.

- If the file doesn't exist yet, but --uuid= is passed on the nspawn cmdline,
  this UUID is passed in $container_uuid to PID 1, and PID 1 is then expected
  to persist this to /etc/machine-id for future boots (which systemd already
  does).

- If the file doesn#t exist yet, and no --uuid= is passed a random UUID is
  generated and passed via $container_uuid.

The result is that /etc/machine-id is never initialized by nspawn itself, thus
unbreaking the volatile mode. However still the machine ID configured in the
machine always matches nspawn's and thus machined's idea of it.

Fixes: #3611
2016-07-22 12:59:36 +02:00
Lennart Poettering 691675ba9f nspawn: rework machine/boot ID handling code to use new calls from id128-util.[ch] 2016-07-22 12:59:36 +02:00
Lennart Poettering 910fd145f4 sd-id128: split UUID file read/write code into new id128-util.[ch]
We currently have code to read and write files containing UUIDs at various
places. Unify this in id128-util.[ch], and move some other stuff there too.

The new files are located in src/libsystemd/sd-id128/ (instead of src/shared/),
because they are actually the backend of sd_id128_get_machine() and
sd_id128_get_boot().

In follow-up patches we can use this reduce the code in nspawn and
machine-id-setup by adopted the common implementation.
2016-07-22 12:59:36 +02:00
Lennart Poettering 3bbaff3e08 tree-wide: use sd_id128_is_null() instead of sd_id128_equal where appropriate
It's a bit easier to read because shorter. Also, most likely a tiny bit faster.
2016-07-22 12:38:08 +02:00
Lennart Poettering a6bc7db980 nspawn: if an ESP is part of the disk image to operate on, mount it to /efi or /boot
Matching the behaviour of gpt-auto-generator, if we find an ESP while
dissecting a container image, mount it to /efi or /boot if those dirs exist and
are empty.

This should enable us to run "bootctl" inside a container and do the right
thing.
2016-07-21 11:10:35 +02:00
Lennart Poettering 1ddc1272e7 nspawn: when netns is on, mount /proc/sys/net writable
Normally we make all of /proc/sys read-only in a container, but if we do have
netns enabled we can make /proc/sys/net writable, as things are virtualized
then.
2016-07-20 14:53:15 +02:00
Lennart Poettering 065d31c360 nspawn: document why the uid shift range is the way it is 2016-07-20 14:53:15 +02:00
Thomas Hindoe Paaboel Andersen ba19c6e181 treewide: remove unused variables 2016-07-18 22:32:08 +02:00
Zbigniew Jędrzejewski-Szmek 2ed968802c tree-wide: get rid of selinux_context_t (#3732)
9eb9c93275
deprecated selinux_context_t. Replace with a simple char* everywhere.

Alternative fix for #3719.
2016-07-15 18:44:02 +02:00
Michael Biebl 595bfe7df2 Various fixes for typos found by lintian (#3705) 2016-07-12 12:52:11 +02:00
Christian Brauner 0996ef00fb nspawn: handle cgroup namespaces
(NOTE: Cgroup namespaces work with legacy and unified hierarchies: "This is
completely backward compatible and will be completely invisible to any existing
cgroup users (except for those running inside a cgroup namespace and looking at
/proc/pid/cgroup of tasks outside their namespace.)"
(https://lists.linuxfoundation.org/pipermail/containers/2016-January/036582.html)
So there is no need to special case unified.)

If cgroup namespaces are supported we skip mount_cgroups() in the
outer_child(). Instead, we unshare(CLONE_NEWCGROUP) in the inner_child() and
only then do we call mount_cgroups().
The clean way to handle cgroup namespaces would be to delegate mounting of
cgroups completely to the init system in the container. However, this would
likely break backward compatibility with the UNIFIED_CGROUP_HIERARCHY flag of
systemd-nspawn. Also no cgroupfs would be mounted whenever the user simply
requests a shell and no init is available to mount cgroups. Hence, we introduce
mount_legacy_cgns_supported(). After calling unshare(CLONE_NEWCGROUP) it parses
/proc/self/cgroup to find the mounted controllers and mounts them inside the
new cgroup namespace. This should preserve backward compatibility with the
UNIFIED_CGROUP_HIERARCHY flag and mount a cgroupfs when no init in the
container is running.
2016-07-09 06:34:11 +02:00
Lennart Poettering 50b52222f2 nspawn: order caps to retain alphabetically 2016-06-13 16:25:54 +02:00
Alessandro Puccetti 9c1e04d0fa nspawn: introduce --notify-ready=[no|yes] (#3474)
This the patch implements a notificaiton mechanism from the init process
in the container to systemd-nspawn.
The switch --notify-ready=yes configures systemd-nspawn to wait the "READY=1"
message from the init process in the container to send its own to systemd.
--notify-ready=no is equivalent to the previous behavior before this patch,
systemd-nspawn notifies systemd with a "READY=1" message when the container is
created. This notificaiton mechanism uses socket file with path relative to the contanier
"/run/systemd/nspawn/notify". The default values it --notify-ready=no.
It is also possible to configure this mechanism from the .nspawn files using
NotifyReady. This parameter takes the same options of the command line switch.

Before this patch, systemd-nspawn notifies "ready" after the inner child was created,
regardless the status of the service running inside it. Now, with --notify-ready=yes,
systemd-nspawn notifies when the service is ready. This is really useful when
there are dependencies between different contaniers.

Fixes https://github.com/systemd/systemd/issues/1369
Based on the work from https://github.com/systemd/systemd/pull/3022

Testing:
Boot a OS inside a container with systemd-nspawn.
Note: modify the commands accordingly with your filesystem.

1. Create a filesystem where you can boot an OS.
2. sudo systemd-nspawn -D ${HOME}/distros/fedora-23/ sh
2.1. Create the unit file /etc/systemd/system/sleep.service inside the container
     (You can use the example below)
2.2. systemdctl enable sleep
2.3 exit
3. sudo systemd-run --service-type=notify --unit=notify-test
   ${HOME}/systemd/systemd-nspawn --notify-ready=yes
   -D ${HOME}/distros/fedora-23/ -b
4. In a different shell run "systemctl status notify-test"

When using --notify-ready=yes the service status is "activating" for 20 seconds
before being set to "active (running)". Instead, using --notify-ready=no
the service status is marked "active (running)" quickly, without waiting for
the 20 seconds.

This patch was also test with --private-users=yes, you can test it just adding it
at the end of the command at point 3.

------ sleep.service ------
[Unit]
Description=sleep
After=network.target

[Service]
Type=oneshot
ExecStart=/bin/sleep 20

[Install]
WantedBy=multi-user.target
------------ end ------------
2016-06-10 13:09:06 +02:00
Michael Karcher 8869a0b40b util-lib: Add sparc64 support for process creation (#3348)
The current raw_clone function takes two arguments, the cloning flags and
a pointer to the stack for the cloned child. The raw cloning without
passing a "thread main" function does not make sense if a new stack is
specified, as it returns in both the parent and the child, which will fail
in the child as the stack is virgin. All uses of raw_clone indeed pass NULL
for the stack pointer which indicates that both processes should share the
stack address (so you better don't pass CLONE_VM).

This commit refactors the code to not require the caller to pass the stack
address, as NULL is the only sensible option. It also adds the magic code
needed to make raw_clone work on sparc64, which does not return 0 in %o0
for the child, but indicates the child process by setting %o1 to non-zero.
This refactoring is not plain aesthetic, because non-NULL stack addresses
need to get mangled before being passed to the clone syscall (you have to
apply STACK_BIAS), whereas NULL must not be mangled. Implementing the
conditional mangling of the stack address would needlessly complicate the
code.

raw_clone is moved to a separete header, because the burden of including
the assert machinery and sched.h shouldn't be applied to every user of
missing_syscalls.h
2016-05-29 20:03:51 -04:00
Djalal Harouni 520e0d541f nspawn: rename arg_retain to arg_caps_retain
The argument is about capabilities.
2016-05-26 22:43:34 +02:00
Djalal Harouni f011b0b87a nspawn: split out seccomp call into nspawn-seccomp.[ch]
Split seccomp into nspawn-seccomp.[ch]. Currently there are no changes,
but this will make it easy in the future to share or use the seccomp logic
from systemd core.
2016-05-26 22:42:29 +02:00
Zbigniew Jędrzejewski-Szmek b5a2179b10 nspawn: remove unreachable return statement (#3320) 2016-05-22 13:02:41 +02:00
Lennart Poettering 2099b3e993 nspawn: drop spurious newline 2016-05-12 20:14:58 +02:00
Lennart Poettering 7513c5b89f nspawn: only remove veth links we created ourselves
Let's make sure we don't remove veth links that existed before nspawn was
invoked.

https://github.com/systemd/systemd/pull/3209#discussion_r62439999
2016-05-09 15:45:31 +02:00
Lennart Poettering 22b28dfdc7 nspawn: add new --network-zone= switch for automatically managed bridge devices
This adds a new concept of network "zones", which are little more than bridge
devices that are automatically managed by nspawn: when the first container
referencing a bridge is started, the bridge device is created, when the last
container referencing it is removed the bridge device is removed again. Besides
this logic --network-zone= is pretty much identical to --network-bridge=.

The usecase for this is to make it easy to run multiple related containers
(think MySQL in one and Apache in another) in a common, named virtual Ethernet
broadcast zone, that only exists as long as one of them is running, and fully
automatically managed otherwise.
2016-05-09 15:45:31 +02:00
Lennart Poettering ef76dff225 util-lib: add new ifname_valid() call that validates interface names
Make use of this in nspawn at a couple of places. A later commit should port
more code over to this, including networkd.
2016-05-09 15:45:31 +02:00
Zbigniew Jędrzejewski-Szmek 5ab1cef0db Merge pull request #3111 from poettering/nspawn-remove-veth 2016-05-03 13:53:00 -04:00
Zbigniew Jędrzejewski-Szmek c29f959b44 Revert "nspawn: explicitly remove veth links after use (#3111)"
This reverts commit d2773e59de.

Merge got squashed by mistake.
2016-05-03 13:53:00 -04:00
Evgeny Vereshchagin e192a2815e nspawn: convert uuid to string (#3146)
Fixes:
cp /etc/machine-id /var/tmp/systemd-test.HccKPa/nspawn-root/etc
systemd-nspawn -D /var/tmp/systemd-test.HccKPa/nspawn-root --link-journal host -b
...
Host and machine ids are equal (P�S!V): refusing to link journals
2016-04-29 10:38:35 +02:00
Evgeny Vereshchagin 5aa3eba50c nspawn: initialize the veth_name (#3141)
Fixes:
$ systemd-nspawn -h
...
Failed to remove veth interface ����: Operation not permitted

This is a follow-up for d2773e59de
2016-04-28 19:48:17 +02:00