(NOTE: Cgroup namespaces work with legacy and unified hierarchies: "This is
completely backward compatible and will be completely invisible to any existing
cgroup users (except for those running inside a cgroup namespace and looking at
/proc/pid/cgroup of tasks outside their namespace.)"
(https://lists.linuxfoundation.org/pipermail/containers/2016-January/036582.html)
So there is no need to special case unified.)
If cgroup namespaces are supported we skip mount_cgroups() in the
outer_child(). Instead, we unshare(CLONE_NEWCGROUP) in the inner_child() and
only then do we call mount_cgroups().
The clean way to handle cgroup namespaces would be to delegate mounting of
cgroups completely to the init system in the container. However, this would
likely break backward compatibility with the UNIFIED_CGROUP_HIERARCHY flag of
systemd-nspawn. Also no cgroupfs would be mounted whenever the user simply
requests a shell and no init is available to mount cgroups. Hence, we introduce
mount_legacy_cgns_supported(). After calling unshare(CLONE_NEWCGROUP) it parses
/proc/self/cgroup to find the mounted controllers and mounts them inside the
new cgroup namespace. This should preserve backward compatibility with the
UNIFIED_CGROUP_HIERARCHY flag and mount a cgroupfs when no init in the
container is running.
This the patch implements a notificaiton mechanism from the init process
in the container to systemd-nspawn.
The switch --notify-ready=yes configures systemd-nspawn to wait the "READY=1"
message from the init process in the container to send its own to systemd.
--notify-ready=no is equivalent to the previous behavior before this patch,
systemd-nspawn notifies systemd with a "READY=1" message when the container is
created. This notificaiton mechanism uses socket file with path relative to the contanier
"/run/systemd/nspawn/notify". The default values it --notify-ready=no.
It is also possible to configure this mechanism from the .nspawn files using
NotifyReady. This parameter takes the same options of the command line switch.
Before this patch, systemd-nspawn notifies "ready" after the inner child was created,
regardless the status of the service running inside it. Now, with --notify-ready=yes,
systemd-nspawn notifies when the service is ready. This is really useful when
there are dependencies between different contaniers.
Fixes https://github.com/systemd/systemd/issues/1369
Based on the work from https://github.com/systemd/systemd/pull/3022
Testing:
Boot a OS inside a container with systemd-nspawn.
Note: modify the commands accordingly with your filesystem.
1. Create a filesystem where you can boot an OS.
2. sudo systemd-nspawn -D ${HOME}/distros/fedora-23/ sh
2.1. Create the unit file /etc/systemd/system/sleep.service inside the container
(You can use the example below)
2.2. systemdctl enable sleep
2.3 exit
3. sudo systemd-run --service-type=notify --unit=notify-test
${HOME}/systemd/systemd-nspawn --notify-ready=yes
-D ${HOME}/distros/fedora-23/ -b
4. In a different shell run "systemctl status notify-test"
When using --notify-ready=yes the service status is "activating" for 20 seconds
before being set to "active (running)". Instead, using --notify-ready=no
the service status is marked "active (running)" quickly, without waiting for
the 20 seconds.
This patch was also test with --private-users=yes, you can test it just adding it
at the end of the command at point 3.
------ sleep.service ------
[Unit]
Description=sleep
After=network.target
[Service]
Type=oneshot
ExecStart=/bin/sleep 20
[Install]
WantedBy=multi-user.target
------------ end ------------
The current raw_clone function takes two arguments, the cloning flags and
a pointer to the stack for the cloned child. The raw cloning without
passing a "thread main" function does not make sense if a new stack is
specified, as it returns in both the parent and the child, which will fail
in the child as the stack is virgin. All uses of raw_clone indeed pass NULL
for the stack pointer which indicates that both processes should share the
stack address (so you better don't pass CLONE_VM).
This commit refactors the code to not require the caller to pass the stack
address, as NULL is the only sensible option. It also adds the magic code
needed to make raw_clone work on sparc64, which does not return 0 in %o0
for the child, but indicates the child process by setting %o1 to non-zero.
This refactoring is not plain aesthetic, because non-NULL stack addresses
need to get mangled before being passed to the clone syscall (you have to
apply STACK_BIAS), whereas NULL must not be mangled. Implementing the
conditional mangling of the stack address would needlessly complicate the
code.
raw_clone is moved to a separete header, because the burden of including
the assert machinery and sched.h shouldn't be applied to every user of
missing_syscalls.h
Split seccomp into nspawn-seccomp.[ch]. Currently there are no changes,
but this will make it easy in the future to share or use the seccomp logic
from systemd core.
This adds a new concept of network "zones", which are little more than bridge
devices that are automatically managed by nspawn: when the first container
referencing a bridge is started, the bridge device is created, when the last
container referencing it is removed the bridge device is removed again. Besides
this logic --network-zone= is pretty much identical to --network-bridge=.
The usecase for this is to make it easy to run multiple related containers
(think MySQL in one and Apache in another) in a common, named virtual Ethernet
broadcast zone, that only exists as long as one of them is running, and fully
automatically managed otherwise.
* sd-netlink: permit RTM_DELLINK messages with no ifindex
This is useful for removing network interfaces by name.
* nspawn: explicitly remove veth links we created after use
Sometimes the kernel keeps veth links pinned after the namespace they have been
joined to died. Let's hence explicitly remove veth links after use.
Fixes: #2173
Sometimes the kernel keeps veth links pinned after the namespace they have been
joined to died. Let's hence explicitly remove veth links after use.
Fixes: #2173
In order to implement this we change the bool arg_userns into an enum
UserNamespaceMode, which can take one of NO, PICK or FIXED, and replace the
arg_uid_range_pick bool with it.
This adds the new value "pick" to --private-users=. When specified a new
UID/GID range of 65536 users is automatically and randomly allocated from the
host range 0x00080000-0xDFFF0000 and used for the container. The setting
implies --private-users-chown, so that container directory is recursively
chown()ed to the newly allocated UID/GID range, if that's necessary. As an
optimization before picking a randomized UID/GID the UID of the container's
root directory is used as starting point and used if currently not used
otherwise.
To protect against using the same UID/GID range multiple times a few mechanisms
are in place:
- The first and the last UID and GID of the range are checked with getpwuid()
and getgrgid(). If an entry already exists a different range is picked. Note
that by "last" UID the user 65534 is used, as 65535 is the 16bit (uid_t) -1.
- A lock file for the range is taken in /run/systemd/nspawn-uid/. Since the
ranges are taken in a non-overlapping fashion, and always start on 64K
boundaries this allows us to maintain a single lock file for each range that
can be randomly picked. This protects nspawn from picking the same range in
two parallel instances.
- If possible the /etc/passwd lock file is taken while a new range is selected
until the container is up. This means adduser/addgroup should safely avoid
the range as long as nss-mymachines is used, since the allocated range will
then show up in the user database.
The UID/GID range nspawn picks from is compiled in and not configurable at the
moment. That should probably stay that way, since we already provide ways how
users can pick their own ranges manually if they don't like the automatic
logic.
The new --private-users=pick logic makes user namespacing pretty useful now, as
it relieves the user from managing UID/GID ranges.
This adds a new --private-userns-chown switch that may be used in combination
with --private-userns. If it is passed a recursive chmod() operation is run on
the OS tree, fixing all file owner UID/GIDs to the right ranges. This should
make user namespacing pretty workable, as the OS trees don't need to be
prepared manually anymore.
We check /etc/machine-id of the container and if it is already populated
we use value from there, possibly ignoring value of --uuid option from
the command line. When dealing with R/O image we setup transient machine
id.
Once we determined machine id of the container, we use this value for
registration with systemd-machined and we also export it via
container_uuid environment variable.
As registration with systemd-machined is done by the main nspawn process
we communicate container machine id established by setup_machine_id from
outer child to the main process by unix domain socket. Similarly to PID
of inner child.
We get
$ systemd-nspawn --image /dev/loop1 --port 8080:80 -n -b 3
--port= is not supported, compiled without libiptc support.
instead of a ping-nc-iptables debugging session
If the user specifies an selinux_apifs_context all content created in
the container including /dev/console should use this label.
Currently when this uses the default label it gets labeled user_devpts_t,
which would require us to write a policy allowing container processes to
manage user_devpts_t. This means that an escaped process would be allowed
to attack all users terminals as well as other container terminals. Changing
the label to match the apifs_context, means the processes would only be allowed
to manage their specific tty.
This change fixes a problem preventing RKT containers from working with systemd-nspawn.
Throughout the tree there's spurious use of spaces separating ++ and --
operators from their respective operands. Make ++ and -- operator
consistent with the majority of existing uses; discard the spaces.
ISO/IEC 9899:1999 §7.21.1/2 says:
Where an argument declared as size_t n specifies the length of the array
for a function, n can have the value zero on a call to that
function. Unless explicitly stated otherwise in the description of a
particular function in this subclause, pointer arguments on such a call
shall still have valid values, as described in 7.1.4.
In base64_append_width memcpy was called as memcpy(x, NULL, 0). GCC 4.9
started making use of this and assumes This worked fine under -O0, but
does something strange under -O3.
This patch fixes a bug in base64_append_width(), fixes a possible bug in
journal_file_append_entry_internal(), and makes use of the new function
to simplify the code in other places.
This adds a new switch --as-pid2, which allows running commands as PID 2, while a stub init process is run as PID 1.
This is useful in order to run arbitrary commands in a container, as PID1's semantics are different from all other
processes regarding reaping of unknown children or signal handling.
Change the capability bounding set parser and logic so that the bounding
set is kept as a positive set internally. This means that the set
reflects those capabilities that we want to keep instead of drop.
GLIB has recently started to officially support the gcc cleanup
attribute in its public API, hence let's do the same for our APIs.
With this patch we'll define an xyz_unrefp() call for each public
xyz_unref() call, to make it easy to use inside a
__attribute__((cleanup())) expression. Then, all code is ported over to
make use of this.
The new calls are also documented in the man pages, with examples how to
use them (well, I only added docs where the _unref() call itself already
had docs, and the examples, only cover sd_bus_unrefp() and
sd_event_unrefp()).
This also renames sd_lldp_free() to sd_lldp_unref(), since that's how we
tend to call our destructors these days.
Note that this defines no public macro that wraps gcc's attribute and
makes it easier to use. While I think it's our duty in the library to
make our stuff easy to use, I figure it's not our duty to make gcc's own
features easy to use on its own. Most likely, client code which wants to
make use of this should define its own:
#define _cleanup_(function) __attribute__((cleanup(function)))
Or similar, to make the gcc feature easier to use.
Making this logic public has the benefit that we can remove three header
files whose only purpose was to define these functions internally.
See #2008.
The new switch operates like --network-veth, but may be specified
multiple times (to define multiple link pairs) and allows flexible
definition of the interface names.
This is an independent reimplementation of #1678, but defines different
semantics, keeping the behaviour completely independent of
--network-veth. It also comes will full hook-up for .nspawn files, and
the matching documentation.
We were hardcoding "systemd-nspawn" as the value of the $container env
variable and "nspawn" as the service string in machined registration.
This commit allows the user to configure it by setting the
$SYSTEMD_NSPAWN_CONTAINER_SERVICE env variable when calling
systemd-nspawn.
If $SYSTEMD_NSPAWN_CONTAINER_SERVICE is not set, we use the string
"systemd-nspawn" for both, fixing the previous inconsistency.
Use the "return log_error_errno(...)" idiom to have fewer curly braces.
The last hunk also fixes the return value of setup_journal(), but the
fix has no practical effect.
The files are named too generically, so that they might conflict with
the upstream project headers. Hence, let's add a "-util" suffix, to
clarify that this are just our utility headers and not any official
upstream headers.
There are more than enough calls doing string manipulations to deserve
its own files, hence do something about it.
This patch also sorts the #include blocks of all files that needed to be
updated, according to the sorting suggestions from CODING_STYLE. Since
pretty much every file needs our string manipulation functions this
effectively means that most files have sorted #include blocks now.
Also touches a few unrelated include files.
get_current_dir_name() can return a variety of errors, not just ENOMEM,
hence don't blindly turn its errors to ENOMEM, but return correct errors
in path_make_absolute_cwd().
This trickles down into a couple of other functions, some of which
receive unrelated minor fixes too with this commit.
Make sure we acquire CAP_NET_ADMIN if we require virtual networking.
Make sure we imply virtual ethernet correctly when bridge is request.
Fixes: #1511Fixes: #1554Fixes: #1590
With this change we understand more than just leaf quota groups for
btrfs file systems. Specifically:
- When we create a subvolume we can now optionally add the new subvolume
to all qgroups its parent subvolume was member of too. Alternatively
it is also possible to insert an intermediary quota group between the
parent's qgroups and the subvolume's leaf qgroup, which is useful for
a concept of "subtree" qgroups, that contain a subvolume and all its
children.
- The remove logic for subvolumes has been updated to optionally remove
any leaf qgroups or "subtree" qgroups, following the logic above.
- The snapshot logic for subvolumes has been updated to replicate the
original qgroup setup of the source, if it follows the "subtree"
design described above. It will not cover qgroup setups that introduce
arbitrary qgroups, especially those orthogonal to the subvolume
hierarchy.
This also tries to be more graceful when setting up /var/lib/machines as
btrfs. For example, if mkfs.btrfs is missing we don't even try to set it
up as loopback device.
Fixes#1559Fixes#1129
Since v3.11/7dc5dbc ("sysfs: Restrict mounting sysfs"), the kernel
doesn't allow mounting sysfs if you don't have CAP_SYS_ADMIN rights over
the network namespace.
So the mounting /sys as a tmpfs code introduced in
d8fc6a000f doesn't work with user
namespaces if we don't use private-net. The reason is that we mount
sysfs inside the container and we're in the network namespace of the host
but we don't have CAP_SYS_ADMIN over that namespace.
To fix that, we mount /sys as a sysfs (instead of tmpfs) if we don't use
private network and ignore the /sys-as-a-tmpfs code if we find that /sys
is already mounted as sysfs.
Fixes#1555
Previously, we'd allocate the TTY, spawn a service on it, but
immediately start processing the TTY and forwarding it to whatever the
commnd was started on. This is however problematic, as the TTY might get
actually opened only much later by the service. We'll hence first get
EIOs on the master as the other side is still closed, and hence
considered it hung up and terminated the session.
With this change we add a flag to the pty forwarding logic:
PTY_FORWARD_IGNORE_INITIAL_VHANGUP. If set, we'll ignore all hangups
(i.e. EIOs) on the master PTY until the first byte is successfully read.
From that point on we consider a hangup/EIO a regular connection termination. This
way, we handle the race: when we get EIO initially we'll ignore it,
until the connection is properly set up, at which time we start
honouring it.
This way we can hide things like /sys/firmware or /sys/hypervisor from
the container, while keeping the device tree around.
While this is a security benefit in itself it also allows us to fix
issue #1277.
Previously we'd mount /sys before creating the user namespace, in order
to be able to mount /sys/fs/cgroup/* beneath it (which resides in it),
which we can only mount outside of the user namespace. To ensure that
the user namespace owns the network namespace we'd set up the network
namespace at the same time as the user namespace. Thus, we'd still see
the /sys/class/net/ from the originating network namespace, even though
we are in our own network namespace now. With this patch, /sys is
mounted before transitioning into the user namespace as tmpfs, so that
we can also mount /sys/fs/cgroup/* into it this early. The directories
such as /sys/class/ are then later added in from the real sysfs from
inside the network and user namespace so that they actually show whatis
available in it.
Fixes#1277
This also allows us to drop build.h from a ton of files, hence do so.
Since we touched the #includes of those files, let's order them properly
according to CODING_STYLE.
Also, make it slightly more powerful, by accepting a flags argument, and
make it safe for handling if more than one cmsg attribute happens to be
attached.
Introduce two new helpers that send/receive a single fd via a unix
transport. Also make nspawn use them instead of hard-coding it.
Based on a patch by Krzesimir Nowak.
off_t is a really weird type as it is usually 64bit these days (at least
in sane programs), but could theoretically be 32bit. We don't support
off_t as 32bit builds though, but still constantly deal with safely
converting from off_t to other types and back for no point.
Hence, never use the type anymore. Always use uint64_t instead. This has
various benefits, including that we can expose these values directly as
D-Bus properties, and also that the values parse the same in all cases.
SOCK_DGRAM and SOCK_SEQPACKET have very similar semantics when used with
socketpair(). However, SOCK_SEQPACKET has the advantage of knowing a
hangup concept, since it is inherently connection-oriented.
Since we use socket pairs to communicate between the nspawn main process
and the nspawn child process, where the child might die abnormally it's
interesting to us to learn about this via hangups if the child side of
the pair is closed. Hence, let's switch to SOCK_SEQPACKET for these
internal communication sockets.
Fixes#956.
.nspawn fiels are simple settings files that may accompany container
images and directories and contain settings otherwise passed on the
nspawn command line. This provides an efficient way to attach execution
data directly to containers.
In the unified hierarchy delegating controller access is safe, hence
make sure to enable all controllers for the "payload" subcgroup if we
create it, so that the container will have all controllers enabled the
nspawn service itself has.
This patch set adds full support the new unified cgroup hierarchy logic
of modern kernels.
A new kernel command line option "systemd.unified_cgroup_hierarchy=1" is
added. If specified the unified hierarchy is mounted to /sys/fs/cgroup
instead of a tmpfs. No further hierarchies are mounted. The kernel
command line option defaults to off. We can turn it on by default as
soon as the kernel's APIs regarding this are stabilized (but even then
downstream distros might want to turn this off, as this will break any
tools that access cgroupfs directly).
It is possibly to choose for each boot individually whether the unified
or the legacy hierarchy is used. nspawn will by default provide the
legacy hierarchy to containers if the host is using it, and the unified
otherwise. However it is possible to run containers with the unified
hierarchy on a legacy host and vice versa, by setting the
$UNIFIED_CGROUP_HIERARCHY environment variable for nspawn to 1 or 0,
respectively.
The unified hierarchy provides reliable cgroup empty notifications for
the first time, via inotify. To make use of this we maintain one
manager-wide inotify fd, and each cgroup to it.
This patch also removes cg_delete() which is unused now.
On kernel 4.2 only the "memory" controller is compatible with the
unified hierarchy, hence that's the only controller systemd exposes when
booted in unified heirarchy mode.
This introduces a new enum for enumerating supported controllers, plus a
related enum for the mask bits mapping to it. The core is changed to
make use of this everywhere.
This moves PID 1 into a new "init.scope" implicit scope unit in the root
slice. This is necessary since on the unified hierarchy cgroups may
either contain subgroups or processes but not both. PID 1 hence has to
move out of the root cgroup (strictly speaking the root cgroup is the
only one where processes and subgroups are still allowed, but in order
to support containers nicey, we move PID 1 into the new scope in all
cases.) This new unit is also used on legacy hierarchy setups. It's
actually pretty useful on all systems, as it can then be used to filter
journal messages coming from PID 1, and so on.
The root slice ("-.slice") is now implicitly created and started (and
does not require a unit file on disk anymore), since
that's where "init.scope" is located and the slice needs to be started
before the scope can.
To check whether we are in unified or legacy hierarchy mode we use
statfs() on /sys/fs/cgroup. If the .f_type field reports tmpfs we are in
legacy mode, if it reports cgroupfs we are in unified mode.
This patch set carefuly makes sure that cgls and cgtop continue to work
as desired.
When invoking nspawn as a service it will implicitly create two
subcgroups in the cgroup it is using, one to move the nspawn process
into, the other to move the actual container processes into. This is
done because of the requirement that cgroups may either contain
processes or other subgroups.
--bind and --bind-ro perform the bind mount
non-recursively. It is sometimes (often?) desirable
to do a recursive mount. This patch adds an optional
set of bind mount options in the form of:
--bind=src-path:dst-path:options
options are comma separated and currently only
"rbind" and "norbind" are allowed.
Default value is "rbind".
Overlayfs uses , as an option separator and : as a list separator. These
characters are both valid in file paths, so overlayfs allows file paths
which contain these characters to backslash escape these values.
This now accepts : characters with the \: escape sequence.
Other escape sequences are also interpreted, but having a \ in your file
path is less likely than :, so this shouldn't break anyone's existing
tools.
Mounting devpts with a uid breaks pty allocation with recent glibc
versions, which expect that the kernel will set the correct owner for
user-allocated ptys.
The kernel seems to be smart enough to use the correct uid for root when
we switch to a user namespace.
This resolves#337.
The latest consolidation cleanup of write_string_file() revealed some users
of that helper which should have used write_string_file_no_create() in the
past but didn't. Basically, all existing users that write to files in /sys
and /proc should not expect to write to a file which is not yet existant.
Merge write_string_file(), write_string_file_no_create() and
write_string_file_atomic() into write_string_file() and provide a flags mask
that allows combinations of atomic writing, newline appending and automatic
file creation. Change all users accordingly.
There is logic to determine the UID shift from the file-system, rather
than having it be explicitly passed in.
However, this needs to happen in the child process that sets up the
mounts, as what's important is the UID of the mounted root, rather than
the mount-point.
Setting up the UID map needs to happen in the parent becuase the inner
child needs to have been started, and the outer child is no longer able
to access the uid_map file, since it lost access to it when setting up
the mounts for the inner child.
So we need to communicate the uid shift back out, along with the PID of
the inner child process.
Failing to communicate this means that the invalid UID shift, which is
the value used to specify "this needs to be determined from the file
system" is left invalid, so setting up the user namespace's UID shift
fails.
sd_bus_flush_close_unref() is a call that simply combines sd_bus_flush()
(which writes all unwritten messages out) + sd_bus_close() (which
terminates the connection, releasing all unread messages) +
sd_bus_unref() (which frees the connection).
The combination of this call is used pretty frequently in systemd tools
right before exiting, and should also be relevant for most external
clients, and is hence useful to cover in a call of its own.
Previously the combination of the three calls was already done in the
_cleanup_bus_close_unref_ macro, but this was only available internally.
Also see #327
It is needed in one branch of the fork, but calculated in another
branch.
Failing to do this means using --private-users without specifying a uid
shift always fails because it tries to shift the uid to UID_INVALID.
When we do a MS_BIND mount, it inherits the flags of its parent mount.
When we do a remount, it sets the flags to exactly what is specified.
If we are in a user namespace then these mount points have their flags
locked, so you can't reduce the protection.
As a consequence, the default setup of mount_all doesn't work with user
namespaces. However if we ensure we add the mount flags of the parent
mount when remounting, then we aren't removing mount options, so we
aren't trying to unlock an option that we aren't allowed to.
In such a case let's suppress the warning (downgrade to LOG_DEBUG),
under the assumption that the user has no config file to update in its
place, but a symlink that points to something like resolved's
automatically managed resolve.conf file.
While we are at it, also stop complaining if we cannot write /etc/resolv.conf
due to a read-only disk, given that there's little we could do about it.
If the kernel do not support user namespace then one of the children
created by nspawn parent will fail at clone(CLONE_NEWUSER) with the
generic error EINVAL and without logging the error. At the same time
the parent may also try to setup the user namespace and will fail with
another error.
To improve this, check if the kernel supports user namespace as early
as possible.
This ports a lot of manual code over to sigprocmask_many() and friends.
Also, we now consistly check for sigprocmask() failures with
assert_se(), since the call cannot realistically fail unless there's a
programming error.
Also encloses a few sd_event_add_signal() calls with (void) when we
ignore the return values for it knowingly.
Remove old temporary snapshots, but only at boot. Ideally we'd have
"self-destroying" btrfs snapshots that go away if the last last
reference to it does. To mimic a scheme like this at least remove the
old snapshots on fresh boots, where we know they cannot be referenced
anymore. Note that we actually remove all temporary files in
/var/lib/machines/ at boot, which should be safe since the directory has
defined semantics. In the root directory (where systemd-nspawn
--ephemeral places snapshots) we are more strict, to avoid removing
unrelated temporary files.
This also splits out nspawn/container related tmpfiles bits into a new
tmpfiles snippet to systemd-nspawn.conf
This adds a "char *extra" parameter to tempfn_xxxxxx(), tempfn_random(),
tempfn_ranomd_child(). If non-NULL this string is included in the middle
of the newly created file name. This is useful for being able to
distuingish the kind of temporary file when we see one.
This also adds tests for the three call.
For now, we don't make use of this at all, but port all users over.
seccomp_load returns -EINVAL when seccomp support is not enabled in the
kernel [1]. This should be a debug log, not an error that interrupts nspawn.
If the seccomp filter can't be set and audit is enabled, the user will
get an error message anyway.
[1]: http://man7.org/linux/man-pages/man2/prctl.2.html
Also, when the child is potentially long-running make sure to set a
death signal.
Also, ignore the result of the reset operations explicitly by casting
them to (void).
Rather than checking the return of asprintf() we are checking if buf gets allocated,
make it clear that it is ok to ignore the return value.
Fixes CID 1299644.
Allowed interface name is relatively small. Lets not make
users go in to the source code to figure out what happened.
--machine=debian-tree conflicts with
--machine=debian-tree2
ex: Failed to add new veth \
interfaces (host0, vb-debian-tree): File exists
When systemd-nspawn gets exec*()ed, it inherits the followings file
descriptors:
- 0, 1, 2: stdin, stdout, stderr
- SD_LISTEN_FDS_START, ... SD_LISTEN_FDS_START+LISTEN_FDS: file
descriptors passed by the system manager (useful for socket
activation). They are passed to the child process (process leader).
- extra lock fd: rkt passes a locked directory as an extra fd, so the
directory remains locked as long as the container is alive.
systemd-nspawn used to close all open fds except 0, 1, 2 and the
SD_LISTEN_FDS_START..SD_LISTEN_FDS_START+LISTEN_FDS. This patch delays
the close just before the exec so the nspawn process (parent) keeps the
extra fds open.
This patch supersedes the previous attempt ("cloexec extraneous fds"):
http://lists.freedesktop.org/archives/systemd-devel/2015-May/031608.html
If a symlink to a combined cgroup hierarchy already exists and points to
the right path, skip it. This avoids an error when the cgroups are set
manually before calling nspawn.
Previously all bind mount mounts were applied in the order specified,
followed by all tmpfs mounts in the order specified. This is
problematic, if bind mounts shall be placed within tmpfs mounts.
This patch hence reworks the custom mount point logic, and alwas applies
them in strict prefix-first order. This means the order of mounts
specified on the command line becomes irrelevant, the right operation
will always be executed.
While we are at it this commit also adds native support for overlayfs
mounts, as supported by recent kernels.
Some systems abusively restrict mknod, even when the device node already
exists in /dev. This is unfortunate because it prevents systemd-nspawn
from creating the basic devices in /dev in the container.
This patch implements a workaround: when mknod fails, fallback on bind
mounts.
Additionally, /dev/console was created with a mknod with the same
major/minor as /dev/null before bind mounting a pts on it. This patch
removes the mknod and creates an empty regular file instead.
In order to test this patch, I used the following configuration, which I
think should replicate the system with the abusive restriction on mknod:
# grep devices /proc/self/cgroup
4:devices:/user.slice/restrict
# cat /sys/fs/cgroup/devices/user.slice/restrict/devices.list
c 1:9 r
c 5:2 rw
c 136:* rw
# systemd-nspawn --register=false -D .
v2:
- remove "bind", it is not needed since there is already MS_BIND
v3:
- fix error management when calling touch()
- fix lowercase in error message
We have no such check in any of the other tools, hence don't have one in
nspawn either.
(This should make things nicer for Rocket, among other things)
Note: removing this check does not mean that we support running nspawn
on non-systemd. We explicitly don't. It just means that we remove the
check for running it like that. You are still on your own if you do...
This change makes it so all seccomp filters are mapped
to the appropriate capability and are only added if that
capability was not requested when running the container.
This unbreaks the remaining use cases broken by the
addition of seccomp filters without respecting requested
capabilities.
Co-Authored-By: Clif Houck <me@clifhouck.com>
[zj: - adapt to our coding style, make struct anonymous]
This patch removes includes that are not used. The removals were found with
include-what-you-use which checks if any of the symbols from a header is
in use.
Previously we always invoked the container PID 1 on /dev/console of the
container. With this change we do so only if nspawn was invoked
interactively (i.e. its stdin/stdout was connected to a TTY). In all other
cases we directly pass through the fds unmodified.
This has the benefit that nspawn can be added into shell pipelines.
https://bugs.freedesktop.org/show_bug.cgi?id=87732
nspawn containers currently block module loading in all cases, with
no option to disable it. This allows an admin, specifically setting
capability=CAP_SYS_MODULE or capability=all to load modules.
After all it is now much more like strjoin() than strappend(). At the
same time, add support for NULL sentinels, even if they are normally not
necessary.
We really want /tmp to be properly mounted, especially in containers
that lack CAP_SYS_ADMIN or that are not fully booted up and only get a
shell, hence let's do so in nspawn already.
When we set up a loopback device with partition probing, the udev
"change" event about the configured device is first passed on to
userspace, only the the in-kernel partition prober is started. Since
partition probing fails with EBUSY when somebody has the device open,
the probing frequently fails since udev starts probing/opening the
device as soon as it gets the notification about it, and it might do so
earlier than the kernel probing.
This patch adds a (hopefully temporary) work-around for this, that
compares the number of probed partitions of the kernel with those of
blkid and synchronously asks for reprobing until the numebrs are in
sync.
This really deserves a proper kernel fix.
After all, nspawn can now dissect MBR partition levels, too, hence
".gpt" appears a misnomer. Moreover, the the .raw suffix for these files
is already pretty popular (the Fedora disk images use it for example),
hence sounds like an OK scheme to adopt.