The old code was only able to pass the value 0 for the inheritable
and ambient capability set when a non-root user was specified.
However, sometimes it is useful to run a program in its own container
with a user specification and some capabilities set. This is needed
when the capabilities cannot be provided by file capabilities (because
the file system is mounted with MS_NOSUID for additional security).
This commit introduces the option --ambient-capability and the config
file option AmbientCapability=. Both are used in a similar way to the
existing Capability= setting. It changes the inheritable and ambient
set (which is 0 by default). The code also checks that the settings
for the bounding set (as defined by Capability= and DropCapability=)
and the setting for the ambient set (as defined by AmbientCapability=)
are compatible. Otherwise, the operation would fail in any way.
Due to the current use of -1 to indicate no support for ambient
capability set the special value "all" cannot be supported.
Also, the setting of ambient capability is restricted to running a
single program in the container payload.
https://tools.ietf.org/html/draft-knodel-terminology-02https://lwn.net/Articles/823224/
This gets rid of most but not occasions of these loaded terms:
1. scsi_id and friends are something that is supposed to be removed from
our tree (see #7594)
2. The test suite defines an API used by the ubuntu CI. We can remove
this too later, but this needs to be done in sync with the ubuntu CI.
3. In some cases the terms are part of APIs we call or where we expose
concepts the kernel names the way it names them. (In particular all
remaining uses of the word "slave" in our codebase are like this,
it's used by the POSIX PTY layer, by the network subsystem, the mount
API and the block device subsystem). Getting rid of the term in these
contexts would mean doing some major fixes of the kernel ABI first.
Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
Let's add flavours for copying stub/uplink resolv.conf versions.
Let's add a more brutal "replace" mode, where we'll replace any existing
destination file.
Let's also change what "auto" means: instead of copying the static file,
let's use the stub file, so that DNS search info is copied over.
Fixes: #15340
The CPU_SET_S api is pretty bad. In particular, it has a parameter for the size
of the array, but operations which take two (CPU_EQUAL_S) or even three arrays
(CPU_{AND,OR,XOR}_S) still take just one size. This means that all arrays must
be of the same size, or buffer overruns will occur. This is exactly what our
code would do, if it received an array of unexpected size over the network.
("Unexpected" here means anything different from what cpu_set_malloc() detects
as the "right" size.)
Let's rework this, and store the size in bytes of the allocated storage area.
The code will now parse any number up to 8191, independently of what the current
kernel supports. This matches the kernel maximum setting for any architecture,
to make things more portable.
Fixes#12605.
This is a pretty large patch, and adds support for OCI runtime bundles
to nspawn. A new switch --oci-bundle= is added that takes a path to an
OCI bundle. The JSON file included therein is read similar to a .nspawn
settings files, however with a different feature set.
Implementation-wise this mostly extends the pre-existing Settings object
to carry additional properties for OCI. However, OCI supports some
concepts .nspawn files did not support yet, which this patch also adds:
1. Support for "masking" files and directories. This functionatly is now
also available via the new --inaccesible= cmdline command, and
Inaccessible= in .nspawn files.
2. Support for mounting arbitrary file systems. (not exposed through
nspawn cmdline nor .nspawn files, because probably not a good idea)
3. Ability to configure the console settings for a container. This
functionality is now also available on the nspawn cmdline in the new
--console= switch (not added to .nspawn for now, as it is something
specific to the invocation really, not a property of the container)
4. Console width/height configuration. Not exposed through
.nspawn/cmdline, but this may be controlled through $COLUMNS and
$LINES like in most other UNIX tools.
5. UID/GID configuration by raw numbers. (not exposed in .nspawn and on
the cmdline, since containers likely have different user tables, and
the existing --user= switch appears to be the better option)
6. OCI hook commands (no exposed in .nspawn/cmdline, as very specific to
OCI)
7. Creation of additional devices nodes in /dev. Most likely not a good
idea, hence not exposed in .nspawn/cmdline. There's already --bind=
to achieve the same, which is the better alternative.
8. Explicit syscall filters. This is not a good idea, due to the skewed
arch support, hence not exposed through .nspawn/cmdline.
9. Configuration of some sysctls on a whitelist. Questionnable, not
supported in .nspawn/cmdline for now.
10. Configuration of all 5 types of capabilities. Not a useful concept,
since the kernel will reduce the caps on execve() anyway. Not
exposed through .nspawn/cmdline as this is not very useful hence.
Note that this only implements the OCI runtime logic itself. It does not
provide a runc-compatible command line tool. This is left for a later
PR. Only with that in place tools such as "buildah" can use the OCI
support in nspawn as drop-in replacement.
Currently still missing is OCI hook support, but it's already parsed and
everything, and should be easy to add. Other than that it's OCI is
implemented pretty comprehensively.
There's a list of incompatibilities in the nspawn-oci.c file. In a later
PR I'd like to convert this into proper markdown and add it to the
documentation directory.
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
This makes most header files easier to look at. Also Emacs gets really
slow when browsing through large sections of overly long prototypes,
which is much improved by this macro.
We should probably not do something similar with too many other cases,
as macros like this might help readability for some, but make it worse
for others. But I think given the complexity of this specific prototype
and how often we use it, it's worth doing.
The use of UINT64_C() in the SettingsMask enum definition is misleading:
it does not mean that individual fields have this width. E.g., with
enum {
FOO = UINT64_C(1)
}
sizeof(FOO) gives 4. It only means that the shift is done properly. So
1 << 35 is undefined, but UINT64_C(1) << 35 is the expected 64 bit
constant. Thus, the use UINT64_C() is useful, because we know that the shifts
are done properly, no matter what the value of _RLIMIT_MAX is, but when those
fields are used in expressions, we don't know what size they will be
(probably 4). Let's add a define which "hides" the enum definition behind a
define which gives the same value but is actually 64 bit. I think this is a
nicer solution than requiring all users to cast SETTING_RLIMIT_FIRST before
use.
Fixes#9035.
Similar as the other options added before, this is primarily useful to
provide comprehensive OCI runtime compatbility, but might be useful
otherwise, too.
This simply controls the PR_SET_NO_NEW_PRIVS flag for the container.
This too is primarily relevant to provide OCI runtime compaitiblity, but
might have other uses too, in particular as it nicely complements the
existing --capability= and --drop-capability= flags.
Previously, the container's hostname was exclusively initialized from
the machine name configured with --machine=, i.e. the internal name and
the external name used for and by the container was synchronized. This
adds a new option --hostname= that optionally allows the internal name
to deviate from the external name.
This new option is mainly useful to ultimately implement the OCI runtime
spec directly in nspawn, but it might be useful on its own for some
other usecases too.
This ensures we set the various resource limits of our container
explicitly on each invocation so that we inherit less from our callers
into the payload.
By default resource limits are now set to the same values Linux
generally passes to the host PID 1, thus minimizing needless differences
between host and container environments.
The limits are now also configurable using a new --rlimit= switch. This
is preparation for teaching nspawn native OCI runtime support as OCI
permits setting resource limits for container payloads, and it hence
probably makes sense if we do too.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
log.h really should only include the bare minimum of other headers, as
it is really pulled into pretty much everything else and already in
itself one of the most basic pieces of code we have.
Let's hence drop inclusion of:
1. sd-id128.h because it's entirely unneeded in current log.h
2. errno.h, dito.
3. sys/signalfd.h which we can replace by a simple struct forward
declaration
4. process-util.h which was needed for getpid_cached() which we now hide
in a funciton log_emergency_level() instead, which nicely abstracts
the details away.
5. sys/socket.h which was needed for struct iovec, but a simple struct
forward declaration suffices for that too.
Ultimately this actually makes our source tree larger (since users of
the functionality above must now include it themselves, log.h won't do
that for them), but I think it helps to untangle our web of includes a
tiny bit.
(Background: I'd like to isolate the generic bits of src/basic/ enough
so that we can do a git submodule import into casync for it)
Now that we have ported nspawn's seccomp code to the generic code in
seccomp-util, let's extend it to support whitelisting and blacklisting
of specific additional syscalls.
This uses similar syntax as PID1's support for system call filtering,
but in contrast to that always implements a blacklist (and not a
whitelist), as we prepopulate the filter with a blacklist, and the
unit's system call filter logic does not come with anything
prepopulated.
(Later on we might actually want to invert the logic here, and
whitelist rather than blacklist things, but at this point let's not do
that. In case we switch this over later, the syscall add/remove logic of
this commit should be compatible conceptually.)
Fixes: #5163
Replaces: #5944
Add a new --pivot-root argument to systemd-nspawn, which specifies a
directory to pivot to / inside the container; while the original / is
pivoted to another specified directory (if provided). This adds
support for booting container images which may contain several bootable
sysroots, as is common with OSTree disk images. When these disk images
are booted on real hardware, ostree-prepare-root is run in conjunction
with sysroot.mount in the initramfs to achieve the same results.
gperf-3.1 generates lookup functions that take a size_t length
parameter instead of unsigned int. Test for this at configure time.
Fixes: https://github.com/systemd/systemd/issues/5039
This the patch implements a notificaiton mechanism from the init process
in the container to systemd-nspawn.
The switch --notify-ready=yes configures systemd-nspawn to wait the "READY=1"
message from the init process in the container to send its own to systemd.
--notify-ready=no is equivalent to the previous behavior before this patch,
systemd-nspawn notifies systemd with a "READY=1" message when the container is
created. This notificaiton mechanism uses socket file with path relative to the contanier
"/run/systemd/nspawn/notify". The default values it --notify-ready=no.
It is also possible to configure this mechanism from the .nspawn files using
NotifyReady. This parameter takes the same options of the command line switch.
Before this patch, systemd-nspawn notifies "ready" after the inner child was created,
regardless the status of the service running inside it. Now, with --notify-ready=yes,
systemd-nspawn notifies when the service is ready. This is really useful when
there are dependencies between different contaniers.
Fixes https://github.com/systemd/systemd/issues/1369
Based on the work from https://github.com/systemd/systemd/pull/3022
Testing:
Boot a OS inside a container with systemd-nspawn.
Note: modify the commands accordingly with your filesystem.
1. Create a filesystem where you can boot an OS.
2. sudo systemd-nspawn -D ${HOME}/distros/fedora-23/ sh
2.1. Create the unit file /etc/systemd/system/sleep.service inside the container
(You can use the example below)
2.2. systemdctl enable sleep
2.3 exit
3. sudo systemd-run --service-type=notify --unit=notify-test
${HOME}/systemd/systemd-nspawn --notify-ready=yes
-D ${HOME}/distros/fedora-23/ -b
4. In a different shell run "systemctl status notify-test"
When using --notify-ready=yes the service status is "activating" for 20 seconds
before being set to "active (running)". Instead, using --notify-ready=no
the service status is marked "active (running)" quickly, without waiting for
the 20 seconds.
This patch was also test with --private-users=yes, you can test it just adding it
at the end of the command at point 3.
------ sleep.service ------
[Unit]
Description=sleep
After=network.target
[Service]
Type=oneshot
ExecStart=/bin/sleep 20
[Install]
WantedBy=multi-user.target
------------ end ------------
This adds a new concept of network "zones", which are little more than bridge
devices that are automatically managed by nspawn: when the first container
referencing a bridge is started, the bridge device is created, when the last
container referencing it is removed the bridge device is removed again. Besides
this logic --network-zone= is pretty much identical to --network-bridge=.
The usecase for this is to make it easy to run multiple related containers
(think MySQL in one and Apache in another) in a common, named virtual Ethernet
broadcast zone, that only exists as long as one of them is running, and fully
automatically managed otherwise.
In order to implement this we change the bool arg_userns into an enum
UserNamespaceMode, which can take one of NO, PICK or FIXED, and replace the
arg_uid_range_pick bool with it.
This adds a new switch --as-pid2, which allows running commands as PID 2, while a stub init process is run as PID 1.
This is useful in order to run arbitrary commands in a container, as PID1's semantics are different from all other
processes regarding reaping of unknown children or signal handling.
The new switch operates like --network-veth, but may be specified
multiple times (to define multiple link pairs) and allows flexible
definition of the interface names.
This is an independent reimplementation of #1678, but defines different
semantics, keeping the behaviour completely independent of
--network-veth. It also comes will full hook-up for .nspawn files, and
the matching documentation.
Make sure we acquire CAP_NET_ADMIN if we require virtual networking.
Make sure we imply virtual ethernet correctly when bridge is request.
Fixes: #1511Fixes: #1554Fixes: #1590
.nspawn fiels are simple settings files that may accompany container
images and directories and contain settings otherwise passed on the
nspawn command line. This provides an efficient way to attach execution
data directly to containers.