Commit graph

48 commits

Author SHA1 Message Date
Torsten Hilbrich 88fc9c9bad systemd-nspawn: Allow setting ambient capability set
The old code was only able to pass the value 0 for the inheritable
and ambient capability set when a non-root user was specified.

However, sometimes it is useful to run a program in its own container
with a user specification and some capabilities set. This is needed
when the capabilities cannot be provided by file capabilities (because
the file system is mounted with MS_NOSUID for additional security).

This commit introduces the option --ambient-capability and the config
file option AmbientCapability=. Both are used in a similar way to the
existing Capability= setting. It changes the inheritable and ambient
set (which is 0 by default). The code also checks that the settings
for the bounding set (as defined by Capability= and DropCapability=)
and the setting for the ambient set (as defined by AmbientCapability=)
are compatible. Otherwise, the operation would fail in any way.

Due to the current use of -1 to indicate no support for ambient
capability set the special value "all" cannot be supported.

Also, the setting of ambient capability is restricted to running a
single program in the container payload.
2020-12-07 19:56:59 +01:00
Yu Watanabe db9ecf0501 license: LGPL-2.1+ -> LGPL-2.1-or-later 2020-11-09 13:23:58 +09:00
Lennart Poettering 3652872add nspawn: add --set-credential= and --load-credential=
Let's allow passing in creds to containers, so that PID 1 inside the
container can pick them up.
2020-08-25 19:45:47 +02:00
Lennart Poettering 6b000af4f2 tree-wide: avoid some loaded terms
https://tools.ietf.org/html/draft-knodel-terminology-02
https://lwn.net/Articles/823224/

This gets rid of most but not occasions of these loaded terms:

1. scsi_id and friends are something that is supposed to be removed from
   our tree (see #7594)

2. The test suite defines an API used by the ubuntu CI. We can remove
   this too later, but this needs to be done in sync with the ubuntu CI.

3. In some cases the terms are part of APIs we call or where we expose
   concepts the kernel names the way it names them. (In particular all
   remaining uses of the word "slave" in our codebase are like this,
   it's used by the POSIX PTY layer, by the network subsystem, the mount
   API and the block device subsystem). Getting rid of the term in these
   contexts would mean doing some major fixes of the kernel ABI first.

Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
2020-06-25 09:00:19 +02:00
Tobias Hunger 129635333d repart: Add UUID option to config files
Add a option to provide a UUID for the partition that will get
created and document that.
2020-05-25 15:48:59 +02:00
Lennart Poettering 86775e3524 nspawn: beef up --resolve-conf= modes
Let's add flavours for copying stub/uplink resolv.conf versions.

Let's add a more brutal "replace" mode, where we'll replace any existing
destination file.

Let's also change what "auto" means: instead of copying the static file,
let's use the stub file, so that DNS search info is copied over.

Fixes: #15340
2020-04-22 19:38:04 +02:00
Zbigniew Jędrzejewski-Szmek 0985c7c4e2 Rework cpu affinity parsing
The CPU_SET_S api is pretty bad. In particular, it has a parameter for the size
of the array, but operations which take two (CPU_EQUAL_S) or even three arrays
(CPU_{AND,OR,XOR}_S) still take just one size. This means that all arrays must
be of the same size, or buffer overruns will occur. This is exactly what our
code would do, if it received an array of unexpected size over the network.
("Unexpected" here means anything different from what cpu_set_malloc() detects
as the "right" size.)

Let's rework this, and store the size in bytes of the allocated storage area.

The code will now parse any number up to 8191, independently of what the current
kernel supports. This matches the kernel maximum setting for any architecture,
to make things more portable.

Fixes #12605.
2019-05-29 10:20:42 +02:00
Zbigniew Jędrzejewski-Szmek ca78ad1de9 headers: remove unneeded includes from util.h
This means we need to include many more headers in various files that simply
included util.h before, but it seems cleaner to do it this way.
2019-03-27 11:53:12 +01:00
Zbigniew Jędrzejewski-Szmek b2645747b7 nspawn-oci: fix double free
Also rename function to make it clear that it also frees the array
object itself.
2019-03-22 17:39:12 +01:00
Lennart Poettering de40a3037a nspawn: add support for executing OCI runtime bundles with nspawn
This is a pretty large patch, and adds support for OCI runtime bundles
to nspawn. A new switch --oci-bundle= is added that takes a path to an
OCI bundle. The JSON file included therein is read similar to a .nspawn
settings files, however with a different feature set.

Implementation-wise this mostly extends the pre-existing Settings object
to carry additional properties for OCI. However, OCI supports some
concepts .nspawn files did not support yet, which this patch also adds:

1. Support for "masking" files and directories. This functionatly is now
   also available via the new --inaccesible= cmdline command, and
   Inaccessible= in .nspawn files.

2. Support for mounting arbitrary file systems. (not exposed through
   nspawn cmdline nor .nspawn files, because probably not a good idea)

3. Ability to configure the console settings for a container. This
   functionality is now also available on the nspawn cmdline in the new
   --console= switch (not added to .nspawn for now, as it is something
   specific to the invocation really, not a property of the container)

4. Console width/height configuration. Not exposed through
   .nspawn/cmdline, but this may be controlled through $COLUMNS and
   $LINES like in most other UNIX tools.

5. UID/GID configuration by raw numbers. (not exposed in .nspawn and on
   the cmdline, since containers likely have different user tables, and
   the existing --user= switch appears to be the better option)

6. OCI hook commands (no exposed in .nspawn/cmdline, as very specific to
   OCI)

7. Creation of additional devices nodes in /dev. Most likely not a good
   idea, hence not exposed in .nspawn/cmdline. There's already --bind=
   to achieve the same, which is the better alternative.

8. Explicit syscall filters. This is not a good idea, due to the skewed
   arch support, hence not exposed through .nspawn/cmdline.

9. Configuration of some sysctls on a whitelist. Questionnable, not
   supported in .nspawn/cmdline for now.

10. Configuration of all 5 types of capabilities. Not a useful concept,
    since the kernel will reduce the caps on execve() anyway. Not
    exposed through .nspawn/cmdline as this is not very useful hence.

Note that this only implements the OCI runtime logic itself. It does not
provide a runc-compatible command line tool. This is left for a later
PR. Only with that in place tools such as "buildah" can use the OCI
support in nspawn as drop-in replacement.

Currently still missing is OCI hook support, but it's already parsed and
everything, and should be easy to add. Other than that it's OCI is
implemented pretty comprehensively.

There's a list of incompatibilities in the nspawn-oci.c file. In a later
PR I'd like to convert this into proper markdown and add it to the
documentation directory.
2019-03-15 15:41:28 +01:00
Yu Watanabe e93672eeac tree-wide: drop missing.h from headers and use relevant missing_*.h 2018-12-06 13:31:16 +01:00
Yu Watanabe 36dd5ffd5d util: drop missing.h from util.h 2018-12-04 10:00:34 +01:00
Jiuyang liu a2f577fca0 add ephemeral to nspawn-settings. 2018-10-24 10:22:20 +02:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Lennart Poettering f728ab1724 nspawn: let's rename _FORCE_ENUM_WIDTH → _SETTING_FORCE_ENUM_WIDTH
Just some preparation in case we need a similar hack in another enum one
day.
2018-05-22 16:21:26 +02:00
Lennart Poettering 1688841f46 nspawn: similar to the previous patches, also make /etc/localtime handling more configurable
Fixes: #9009
2018-05-22 16:21:26 +02:00
Lennart Poettering 4e1d6aa983 nspawn: make --link-journal= configurable through .nspawn files, too 2018-05-22 16:20:08 +02:00
Lennart Poettering 09d423e921 nspawn: add greater control over how /etc/resolv.conf is handled
Fixes: #8014 #1781
2018-05-22 16:19:26 +02:00
Lennart Poettering 8904ab86b0
Merge pull request #9062 from poettering/parse-conf-macro
add new CONFIG_PARSER_PROTOTYPE() macro
2018-05-22 16:14:49 +02:00
Lennart Poettering a210692525 tree-wide: port over all code to the new CONFIG_PARSER_PROTOTYPE() macro
This makes most header files easier to look at. Also Emacs gets really
slow when browsing through large sections of overly long prototypes,
which is much improved by this macro.

We should probably not do something similar with too many other cases,
as macros like this might help readability for some, but make it worse
for others. But I think given the complexity of this specific prototype
and how often we use it, it's worth doing.
2018-05-22 13:18:44 +02:00
Zbigniew Jędrzejewski-Szmek b49c6ca089 systemd-nspawn: make SettingsMask 64 bit wide
The use of UINT64_C() in the SettingsMask enum definition is misleading:
it does not mean that individual fields have this width. E.g., with
enum {
   FOO = UINT64_C(1)
}
sizeof(FOO) gives 4. It only means that the shift is done properly. So
1 << 35 is undefined, but UINT64_C(1) << 35 is the expected 64 bit
constant. Thus, the use UINT64_C() is useful, because we know that the shifts
are done properly, no matter what the value of _RLIMIT_MAX is, but when those
fields are used in expressions, we don't know what size they will be
(probably 4). Let's add a define which "hides" the enum definition behind a
define which gives the same value but is actually 64 bit. I think this is a
nicer solution than requiring all users to cast SETTING_RLIMIT_FIRST before
use.

Fixes #9035.
2018-05-22 10:51:49 +02:00
Lennart Poettering d107bb7d63 nspawn: add a new --cpu-affinity= switch
Similar as the other options added before, this is primarily useful to
provide comprehensive OCI runtime compatbility, but might be useful
otherwise, too.
2018-05-17 20:48:54 +02:00
Lennart Poettering 81f345dfed nspawn: add a new --oom-score-adjust= command line switch
This is primarily useful in order to provide comprehensive OCI runtime
compatibility with nspawn, but might have uses outside of it.
2018-05-17 20:48:12 +02:00
Lennart Poettering 66edd96310 nspawn: add a new --no-new-privileges= cmdline option to nspawn
This simply controls the PR_SET_NO_NEW_PRIVS flag for the container.
This too is primarily relevant to provide OCI runtime compaitiblity, but
might have other uses too, in particular as it nicely complements the
existing --capability= and --drop-capability= flags.
2018-05-17 20:47:20 +02:00
Lennart Poettering 3a9530e5f1 nspawn: make the hostname of the container explicitly configurable with a new --hostname= switch
Previously, the container's hostname was exclusively initialized from
the machine name configured with --machine=, i.e. the internal name and
the external name used for and by the container was synchronized. This
adds a new option --hostname= that optionally allows the internal name
to deviate from the external name.

This new option is mainly useful to ultimately implement the OCI runtime
spec directly in nspawn, but it might be useful on its own for some
other usecases too.
2018-05-17 20:46:45 +02:00
Lennart Poettering bf428efb07 nspawn: add new --rlimit= switch, and always set resource limits explicitly for our container payloads
This ensures we set the various resource limits of our container
explicitly on each invocation so that we inherit less from our callers
into the payload.

By default resource limits are now set to the same values Linux
generally passes to the host PID 1, thus minimizing needless differences
between host and container environments.

The limits are now also configurable using a new --rlimit= switch. This
is preparation for teaching nspawn native OCI runtime support as OCI
permits setting resource limits for container payloads, and it hence
probably makes sense if we do too.
2018-05-17 20:45:54 +02:00
Lennart Poettering 88614c8a28 nspawn: size_t more stuff
A follow-up for #8840
2018-05-03 17:19:46 +02:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Lennart Poettering dccca82b1a log: minimize includes in log.h
log.h really should only include the bare minimum of other headers, as
it is really pulled into pretty much everything else and already in
itself one of the most basic pieces of code we have.

Let's hence drop inclusion of:

1. sd-id128.h because it's entirely unneeded in current log.h
2. errno.h, dito.
3. sys/signalfd.h which we can replace by a simple struct forward
   declaration
4. process-util.h which was needed for getpid_cached() which we now hide
   in a funciton log_emergency_level() instead, which nicely abstracts
   the details away.
5. sys/socket.h which was needed for struct iovec, but a simple struct
   forward declaration suffices for that too.

Ultimately this actually makes our source tree larger (since users of
the functionality above must now include it themselves, log.h won't do
that for them), but I think it helps to untangle our web of includes a
tiny bit.

(Background: I'd like to isolate the generic bits of src/basic/ enough
so that we can do a git submodule import into casync for it)
2018-01-11 14:44:31 +01:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Lennart Poettering 960e4569e1 nspawn: implement configurable syscall whitelisting/blacklisting
Now that we have ported nspawn's seccomp code to the generic code in
seccomp-util, let's extend it to support whitelisting and blacklisting
of specific additional syscalls.

This uses similar syntax as PID1's support for system call filtering,
but in contrast to that always implements a blacklist (and not a
whitelist), as we prepopulate the filter with a blacklist, and the
unit's system call filter logic does not come with anything
prepopulated.

(Later on we might actually want to invert the logic here, and
whitelist rather than blacklist things, but at this point let's not do
that. In case we switch this over later, the syscall add/remove logic of
this commit should be compatible conceptually.)

Fixes: #5163

Replaces: #5944
2017-09-12 14:06:21 +02:00
Philip Withnall b53ede699c nspawn: Add support for sysroot pivoting (#5258)
Add a new --pivot-root argument to systemd-nspawn, which specifies a
directory to pivot to / inside the container; while the original / is
pivoted to another specified directory (if provided). This adds
support for booting container images which may contain several bootable
sysroots, as is common with OSTree disk images. When these disk images
are booted on real hardware, ostree-prepare-root is run in conjunction
with sysroot.mount in the initramfs to achieve the same results.
2017-02-08 16:54:31 +01:00
Mike Gilbert c9f7b4d356 build-sys: add check for gperf lookup function signature (#5055)
gperf-3.1 generates lookup functions that take a size_t length
parameter instead of unsigned int. Test for this at configure time.

Fixes: https://github.com/systemd/systemd/issues/5039
2017-01-10 08:39:05 +01:00
Lennart Poettering 7b4318b6a5 nspawn: add ability to configure overlay mounts to .nspawn files
Fixes: #4634
2016-12-01 12:41:17 +01:00
Alessandro Puccetti 9c1e04d0fa nspawn: introduce --notify-ready=[no|yes] (#3474)
This the patch implements a notificaiton mechanism from the init process
in the container to systemd-nspawn.
The switch --notify-ready=yes configures systemd-nspawn to wait the "READY=1"
message from the init process in the container to send its own to systemd.
--notify-ready=no is equivalent to the previous behavior before this patch,
systemd-nspawn notifies systemd with a "READY=1" message when the container is
created. This notificaiton mechanism uses socket file with path relative to the contanier
"/run/systemd/nspawn/notify". The default values it --notify-ready=no.
It is also possible to configure this mechanism from the .nspawn files using
NotifyReady. This parameter takes the same options of the command line switch.

Before this patch, systemd-nspawn notifies "ready" after the inner child was created,
regardless the status of the service running inside it. Now, with --notify-ready=yes,
systemd-nspawn notifies when the service is ready. This is really useful when
there are dependencies between different contaniers.

Fixes https://github.com/systemd/systemd/issues/1369
Based on the work from https://github.com/systemd/systemd/pull/3022

Testing:
Boot a OS inside a container with systemd-nspawn.
Note: modify the commands accordingly with your filesystem.

1. Create a filesystem where you can boot an OS.
2. sudo systemd-nspawn -D ${HOME}/distros/fedora-23/ sh
2.1. Create the unit file /etc/systemd/system/sleep.service inside the container
     (You can use the example below)
2.2. systemdctl enable sleep
2.3 exit
3. sudo systemd-run --service-type=notify --unit=notify-test
   ${HOME}/systemd/systemd-nspawn --notify-ready=yes
   -D ${HOME}/distros/fedora-23/ -b
4. In a different shell run "systemctl status notify-test"

When using --notify-ready=yes the service status is "activating" for 20 seconds
before being set to "active (running)". Instead, using --notify-ready=no
the service status is marked "active (running)" quickly, without waiting for
the 20 seconds.

This patch was also test with --private-users=yes, you can test it just adding it
at the end of the command at point 3.

------ sleep.service ------
[Unit]
Description=sleep
After=network.target

[Service]
Type=oneshot
ExecStart=/bin/sleep 20

[Install]
WantedBy=multi-user.target
------------ end ------------
2016-06-10 13:09:06 +02:00
Lennart Poettering 22b28dfdc7 nspawn: add new --network-zone= switch for automatically managed bridge devices
This adds a new concept of network "zones", which are little more than bridge
devices that are automatically managed by nspawn: when the first container
referencing a bridge is started, the bridge device is created, when the last
container referencing it is removed the bridge device is removed again. Besides
this logic --network-zone= is pretty much identical to --network-bridge=.

The usecase for this is to make it easy to run multiple related containers
(think MySQL in one and Apache in another) in a common, named virtual Ethernet
broadcast zone, that only exists as long as one of them is running, and fully
automatically managed otherwise.
2016-05-09 15:45:31 +02:00
Lennart Poettering 0de7accea9 nspawn: allow configuration of user namespaces in .nspawn files
In order to implement this we change the bool arg_userns into an enum
UserNamespaceMode, which can take one of NO, PICK or FIXED, and replace the
arg_uid_range_pick bool with it.
2016-04-25 12:16:02 +02:00
Daniel Mack b26fa1a2fb tree-wide: remove Emacs lines from all files
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
2016-02-10 13:41:57 +01:00
Lennart Poettering 7732f92bad nspawn: optionally run a stub init process as PID 1
This adds a new switch --as-pid2, which allows running commands as PID 2, while a stub init process is run as PID 1.
This is useful in order to run arbitrary commands in a container, as PID1's semantics are different from all other
processes regarding reaping of unknown children or signal handling.
2016-02-03 23:58:24 +01:00
Lennart Poettering 5f932eb9af nspawn: add new --chdir= switch
Fixes: #2192
2016-02-03 23:58:24 +01:00
Thomas Hindoe Paaboel Andersen 71d35b6b55 tree-wide: sort includes in *.h
This is a continuation of the previous include sort patch, which
only sorted for .c files.
2015-11-18 23:09:02 +01:00
Lennart Poettering f6d6bad146 nspawn: add new --network-veth-extra= switch for defining additional veth links
The new switch operates like --network-veth, but may be specified
multiple times (to define multiple link pairs) and allows flexible
definition of the interface names.

This is an independent reimplementation of #1678, but defines different
semantics, keeping the behaviour completely independent of
--network-veth. It also comes will full hook-up for .nspawn files, and
the matching documentation.
2015-11-12 22:04:49 +01:00
Lennart Poettering 0e2656744f nspawn: rework how we determine private networking settings
Make sure we acquire CAP_NET_ADMIN if we require virtual networking.

Make sure we imply virtual ethernet correctly when bridge is request.

Fixes: #1511
Fixes: #1554
Fixes: #1590
2015-10-22 01:59:25 +02:00
Lennart Poettering 2b5c04d59c nspawn: remove nspawn.h, it's empty now 2015-09-07 18:47:34 +02:00
Lennart Poettering 7a8f63251d nspawn: split all port exposure code into nspawn-expose-port.[ch] 2015-09-07 18:44:30 +02:00
Lennart Poettering e83bebeff7 nspawn: split out mount related functions into a new nspawn-mount.c file 2015-09-07 18:44:30 +02:00
Lennart Poettering f757855e81 nspawn: add new .nspawn files for container settings
.nspawn fiels are simple settings files that may accompany container
images and directories and contain settings otherwise passed on the
nspawn command line. This provides an efficient way to attach execution
data directly to containers.
2015-09-06 01:49:06 +02:00