This makes most header files easier to look at. Also Emacs gets really
slow when browsing through large sections of overly long prototypes,
which is much improved by this macro.
We should probably not do something similar with too many other cases,
as macros like this might help readability for some, but make it worse
for others. But I think given the complexity of this specific prototype
and how often we use it, it's worth doing.
The use of UINT64_C() in the SettingsMask enum definition is misleading:
it does not mean that individual fields have this width. E.g., with
enum {
FOO = UINT64_C(1)
}
sizeof(FOO) gives 4. It only means that the shift is done properly. So
1 << 35 is undefined, but UINT64_C(1) << 35 is the expected 64 bit
constant. Thus, the use UINT64_C() is useful, because we know that the shifts
are done properly, no matter what the value of _RLIMIT_MAX is, but when those
fields are used in expressions, we don't know what size they will be
(probably 4). Let's add a define which "hides" the enum definition behind a
define which gives the same value but is actually 64 bit. I think this is a
nicer solution than requiring all users to cast SETTING_RLIMIT_FIRST before
use.
Fixes#9035.
Similar as the other options added before, this is primarily useful to
provide comprehensive OCI runtime compatbility, but might be useful
otherwise, too.
This simply controls the PR_SET_NO_NEW_PRIVS flag for the container.
This too is primarily relevant to provide OCI runtime compaitiblity, but
might have other uses too, in particular as it nicely complements the
existing --capability= and --drop-capability= flags.
Previously, the container's hostname was exclusively initialized from
the machine name configured with --machine=, i.e. the internal name and
the external name used for and by the container was synchronized. This
adds a new option --hostname= that optionally allows the internal name
to deviate from the external name.
This new option is mainly useful to ultimately implement the OCI runtime
spec directly in nspawn, but it might be useful on its own for some
other usecases too.
This ensures we set the various resource limits of our container
explicitly on each invocation so that we inherit less from our callers
into the payload.
By default resource limits are now set to the same values Linux
generally passes to the host PID 1, thus minimizing needless differences
between host and container environments.
The limits are now also configurable using a new --rlimit= switch. This
is preparation for teaching nspawn native OCI runtime support as OCI
permits setting resource limits for container payloads, and it hence
probably makes sense if we do too.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
log.h really should only include the bare minimum of other headers, as
it is really pulled into pretty much everything else and already in
itself one of the most basic pieces of code we have.
Let's hence drop inclusion of:
1. sd-id128.h because it's entirely unneeded in current log.h
2. errno.h, dito.
3. sys/signalfd.h which we can replace by a simple struct forward
declaration
4. process-util.h which was needed for getpid_cached() which we now hide
in a funciton log_emergency_level() instead, which nicely abstracts
the details away.
5. sys/socket.h which was needed for struct iovec, but a simple struct
forward declaration suffices for that too.
Ultimately this actually makes our source tree larger (since users of
the functionality above must now include it themselves, log.h won't do
that for them), but I think it helps to untangle our web of includes a
tiny bit.
(Background: I'd like to isolate the generic bits of src/basic/ enough
so that we can do a git submodule import into casync for it)
Now that we have ported nspawn's seccomp code to the generic code in
seccomp-util, let's extend it to support whitelisting and blacklisting
of specific additional syscalls.
This uses similar syntax as PID1's support for system call filtering,
but in contrast to that always implements a blacklist (and not a
whitelist), as we prepopulate the filter with a blacklist, and the
unit's system call filter logic does not come with anything
prepopulated.
(Later on we might actually want to invert the logic here, and
whitelist rather than blacklist things, but at this point let's not do
that. In case we switch this over later, the syscall add/remove logic of
this commit should be compatible conceptually.)
Fixes: #5163
Replaces: #5944
Add a new --pivot-root argument to systemd-nspawn, which specifies a
directory to pivot to / inside the container; while the original / is
pivoted to another specified directory (if provided). This adds
support for booting container images which may contain several bootable
sysroots, as is common with OSTree disk images. When these disk images
are booted on real hardware, ostree-prepare-root is run in conjunction
with sysroot.mount in the initramfs to achieve the same results.
gperf-3.1 generates lookup functions that take a size_t length
parameter instead of unsigned int. Test for this at configure time.
Fixes: https://github.com/systemd/systemd/issues/5039
This the patch implements a notificaiton mechanism from the init process
in the container to systemd-nspawn.
The switch --notify-ready=yes configures systemd-nspawn to wait the "READY=1"
message from the init process in the container to send its own to systemd.
--notify-ready=no is equivalent to the previous behavior before this patch,
systemd-nspawn notifies systemd with a "READY=1" message when the container is
created. This notificaiton mechanism uses socket file with path relative to the contanier
"/run/systemd/nspawn/notify". The default values it --notify-ready=no.
It is also possible to configure this mechanism from the .nspawn files using
NotifyReady. This parameter takes the same options of the command line switch.
Before this patch, systemd-nspawn notifies "ready" after the inner child was created,
regardless the status of the service running inside it. Now, with --notify-ready=yes,
systemd-nspawn notifies when the service is ready. This is really useful when
there are dependencies between different contaniers.
Fixes https://github.com/systemd/systemd/issues/1369
Based on the work from https://github.com/systemd/systemd/pull/3022
Testing:
Boot a OS inside a container with systemd-nspawn.
Note: modify the commands accordingly with your filesystem.
1. Create a filesystem where you can boot an OS.
2. sudo systemd-nspawn -D ${HOME}/distros/fedora-23/ sh
2.1. Create the unit file /etc/systemd/system/sleep.service inside the container
(You can use the example below)
2.2. systemdctl enable sleep
2.3 exit
3. sudo systemd-run --service-type=notify --unit=notify-test
${HOME}/systemd/systemd-nspawn --notify-ready=yes
-D ${HOME}/distros/fedora-23/ -b
4. In a different shell run "systemctl status notify-test"
When using --notify-ready=yes the service status is "activating" for 20 seconds
before being set to "active (running)". Instead, using --notify-ready=no
the service status is marked "active (running)" quickly, without waiting for
the 20 seconds.
This patch was also test with --private-users=yes, you can test it just adding it
at the end of the command at point 3.
------ sleep.service ------
[Unit]
Description=sleep
After=network.target
[Service]
Type=oneshot
ExecStart=/bin/sleep 20
[Install]
WantedBy=multi-user.target
------------ end ------------
This adds a new concept of network "zones", which are little more than bridge
devices that are automatically managed by nspawn: when the first container
referencing a bridge is started, the bridge device is created, when the last
container referencing it is removed the bridge device is removed again. Besides
this logic --network-zone= is pretty much identical to --network-bridge=.
The usecase for this is to make it easy to run multiple related containers
(think MySQL in one and Apache in another) in a common, named virtual Ethernet
broadcast zone, that only exists as long as one of them is running, and fully
automatically managed otherwise.
In order to implement this we change the bool arg_userns into an enum
UserNamespaceMode, which can take one of NO, PICK or FIXED, and replace the
arg_uid_range_pick bool with it.
This adds a new switch --as-pid2, which allows running commands as PID 2, while a stub init process is run as PID 1.
This is useful in order to run arbitrary commands in a container, as PID1's semantics are different from all other
processes regarding reaping of unknown children or signal handling.
The new switch operates like --network-veth, but may be specified
multiple times (to define multiple link pairs) and allows flexible
definition of the interface names.
This is an independent reimplementation of #1678, but defines different
semantics, keeping the behaviour completely independent of
--network-veth. It also comes will full hook-up for .nspawn files, and
the matching documentation.
Make sure we acquire CAP_NET_ADMIN if we require virtual networking.
Make sure we imply virtual ethernet correctly when bridge is request.
Fixes: #1511Fixes: #1554Fixes: #1590
.nspawn fiels are simple settings files that may accompany container
images and directories and contain settings otherwise passed on the
nspawn command line. This provides an efficient way to attach execution
data directly to containers.