Commit Graph

832 Commits

Author SHA1 Message Date
Florian Westphal 761cf19d7b firewall-util: introduce context structure
for planned nft backend we have three choices:

- open/close a new nfnetlink socket for every operation
- keep a nfnetlink socket open internally
- expose a opaque fw_ctx and stash all internal data here.

Originally I opted for the 2nd option, but during review it was
suggested to avoid static storage duration because of perceived
problems with threaded applications.

This adds fw_ctx and new/free functions, then converts the existing api
and nspawn and networkd to use it.
2020-12-16 00:35:56 +01:00
Lennart Poettering 52ef5dd798 hostname-util: flagsify hostname_is_valid(), drop machine_name_is_valid()
Let's clean up hostname_is_valid() a bit: let's turn the second boolean
argument into a more explanatory flags field, and add a flag that
accepts the special name ".host" as valid. This is useful for the
container logic, where the special hostname ".host" refers to the "root
container", i.e. the host system itself, and can be specified at various
places.

let's also get rid of machine_name_is_valid(). It was just an alias,
which is confusing and even more so now that we have the flags param.
2020-12-15 17:59:48 +01:00
Torsten Hilbrich 88fc9c9bad systemd-nspawn: Allow setting ambient capability set
The old code was only able to pass the value 0 for the inheritable
and ambient capability set when a non-root user was specified.

However, sometimes it is useful to run a program in its own container
with a user specification and some capabilities set. This is needed
when the capabilities cannot be provided by file capabilities (because
the file system is mounted with MS_NOSUID for additional security).

This commit introduces the option --ambient-capability and the config
file option AmbientCapability=. Both are used in a similar way to the
existing Capability= setting. It changes the inheritable and ambient
set (which is 0 by default). The code also checks that the settings
for the bounding set (as defined by Capability= and DropCapability=)
and the setting for the ambient set (as defined by AmbientCapability=)
are compatible. Otherwise, the operation would fail in any way.

Due to the current use of -1 to indicate no support for ambient
capability set the special value "all" cannot be supported.

Also, the setting of ambient capability is restricted to running a
single program in the container payload.
2020-12-07 19:56:59 +01:00
Lennart Poettering 986311c2da fileio: teach read_full_file_full() to read from offset/with maximum size 2020-12-01 14:17:47 +01:00
Yu Watanabe db9ecf0501 license: LGPL-2.1+ -> LGPL-2.1-or-later 2020-11-09 13:23:58 +09:00
Lennart Poettering d3dcf4e3b9 fileio: beef up READ_FULL_FILE_CONNECT_SOCKET to allow setting sender socket name
This beefs up the READ_FULL_FILE_CONNECT_SOCKET logic of
read_full_file_full() a bit: when used a sender socket name may be
specified. If specified as NULL behaviour is as before: the client
socket name is picked by the kernel. But if specified as non-NULL the
client can pick a socket name to use when connecting. This is useful to
communicate a minimal amount of metainformation from client to server,
outside of the transport payload.

Specifically, these beefs up the service credential logic to pass an
abstract AF_UNIX socket name as client socket name when connecting via
READ_FULL_FILE_CONNECT_SOCKET, that includes the requesting unit name
and the eventual credential name. This allows servers implementing the
trivial credential socket logic to distinguish clients: via a simple
getpeername() it can be determined which unit is requesting a
credential, and which credential specifically.

Example: with this patch in place, in a unit file "waldo.service" a
configuration line like the following:

    LoadCredential=foo:/run/quux/creds.sock

will result in a connection to the AF_UNIX socket /run/quux/creds.sock,
originating from an abstract namespace AF_UNIX socket:

    @$RANDOM/unit/waldo.service/foo

(The $RANDOM is replaced by some randomized string. This is included in
the socket name order to avoid namespace squatting issues: the abstract
socket namespace is open to unprivileged users after all, and care needs
to be taken not to use guessable names)

The services listening on the /run/quux/creds.sock socket may thus
easily retrieve the name of the unit the credential is requested for
plus the credential name, via a simpler getpeername(), discarding the
random preifx and the /unit/ string.

This logic uses "/" as separator between the fields, since both unit
names and credential names appear in the file system, and thus are
designed to use "/" as outer separators. Given that it's a good safe
choice to use as separators here, too avoid any conflicts.

This is a minimal patch only: the new logic is used only for the unit
file credential logic. For other places where we use
READ_FULL_FILE_CONNECT_SOCKET it is probably a good idea to use this
scheme too, but this should be done carefully in later patches, since
the socket names become API that way, and we should determine the right
amount of info to pass over.
2020-11-03 09:48:04 +01:00
Harald Seiler c5fbeedb0c nspawn: robustly deal with "uninitialized" machine-id
When nspawn starts an image, this image could be in any state, including
an aborted first boot.  For this case, it needs to correctly handle the
situation like there was no machine-id at all.
2020-10-19 16:28:21 +02:00
Frantisek Sumsal d7a0f1f4f9 tree-wide: assorted coccinelle fixes 2020-10-09 15:02:23 +02:00
Lennart Poettering 3462d773d2 nspawn: don't chown() stdin/stdout passed in when --console=pipe is used
We should chown what we allocate ourselves, i.e. any pty we allocate
ourselves. But for stuff we propagate, let's avoid that: we shouldn't
make more changes than necessary.

Fixes: #17229
2020-10-02 12:05:08 +02:00
Zbigniew Jędrzejewski-Szmek 38ee19c04b nspawn: give better message when invoked as non-root without arguments
When invoked as non-root, we would suggest re-running as root without any
further hint. But this immediately spawns a machine from the local directory,
which can be rather surprising. So let's give a better hint.

(In general, I don't think commandline programs should do "significant" things
when invoked without any arguments. In this regard it would be better if
systemd-nspawn would not spawn a machine from the current directory if called
with no arguments and at least "-D ." would be required.)
2020-09-24 16:36:51 +02:00
Lennart Poettering 511a8cfe30 mount-util: switch most mount_verbose() code over to not follow symlinks 2020-09-23 18:57:36 +02:00
Lennart Poettering 065b47749d tree-wide: use ERRNO_IS_PRIVILEGE() whereever appropriate 2020-09-22 16:25:22 +02:00
Lennart Poettering aee36b4ea2 dissect-image: process /usr/ GPT partition type 2020-09-19 21:19:51 +02:00
Lennart Poettering 10e8a60baa nspawn: add --console=autopipe mode
By default we'll run a container in --console=interactive and
--console=read-only mode depending if we are invoked on a tty or not so
that the container always gets a /dev/console allocated, i.e is always
suitable to run a full init system /as those typically expect a
/dev/console to exist).

With the new --console=autopipe mode we do something similar, but
slightly different: when not invoked on a tty we'll use --console=pipe.
This means, if you invoke some tool in a container with this you'll get
full inetractivity if you invoke it on a tty but things will also be
very nicely pipeable. OTOH you cannot invoke a full init system like
this, because you might or might not become a /dev/console this way...

Prompted-by: #17070

(I named this "autopipe" rather than "auto" or so, since the default
mode probably should be named "auto" one day if we add a name for it,
and this is so similar to "auto" except that it uses pipes in the
non-tty case).
2020-09-17 16:39:27 +02:00
Lennart Poettering 335d2eadca nspawn: don't become TTY controller just to undo it later again
Instead of first becoming a controlling process of the payload pty
as side effect of opening it (without O_NOCTTY), and then possibly
dropping it again, let's do it cleanly an reverse the logic: let's open
the pty without becoming its controller first. Only after everything
went the way we wanted it to go become the controller explicitly.

This has the benefit that the PID 1 stub process we run (as effect of
--as-pid2) doesn't have to lose the tty explicitly, but can just
continue running with things. And we explicitly make the tty controlling
right before invoking actual payload.

In order to make sure everything works as expected validate that the
stub PID 1 in the container really has no conrolling tty by issuing the
TIOCNOTTY tty and expecting ENOTTY, and log about it.

This shouldn't change behaviour much, it just makes thins a bit cleaner,
in particular as we'll not trigger SIGHUP on ourselves (since we are
controller and session leader) due to TIOCNOTTY which we then have to
explicitly ignore.
2020-09-17 16:39:23 +02:00
Lennart Poettering 2fef50cd9e nspawn: fix fd leak on failure path 2020-09-17 16:39:19 +02:00
Lennart Poettering 554c4beb47 nspawn: print log notice when we are invoked from a tty but in "pipe" mode
If people do this then things are weird, and they should probably use
--console=interactive (i.e. the default) instead.

Prompted-by: #17070
2020-09-17 16:39:16 +02:00
Lennart Poettering 89e62e0bd3 dissect: wrap verity settings in new VeritySettings structure
Just some refactoring: let's place the various verity related parameters
in a common structure, and pass that around instead of the individual
parameters.

Also, let's load the PKCS#7 signature data when finding metadata
right-away, instead of delaying this until we need it. In all cases we
call this there's not much time difference between the metdata finding
and the loading, hence this simplifies things and makes sure root hash
data and its signature is now always acquired together.
2020-09-17 20:36:23 +09:00
Lennart Poettering 3652872add nspawn: add --set-credential= and --load-credential=
Let's allow passing in creds to containers, so that PID 1 inside the
container can pick them up.
2020-08-25 19:45:47 +02:00
Lennart Poettering 0f48ba7b84 nspawn: provide $container and $container_uuid in /run/host too
This has the major benefit that the entire payload of the container can
access these files there. Previously, we'd set them only as env vars,
but that meant only PID 1 could read them directly or other privileged
payload code with access to /run/1/environ.
2020-08-20 10:17:55 +02:00
Lennart Poettering 9fac502920 nspawn,pid1: pass "inaccessible" nodes from cntr mgr to pid1 payload via /run/host
Let's make /run/host the sole place we pass stuff from host to container
in and place the "inaccessible" nodes in /run/host too.

In contrast to the previous two commits this is a minor compat break, but
not a relevant one I think. Previously the container manager would place
these nodes in /run/systemd/inaccessible/ and that's where PID 1 in the
container would try to add them too when missing. Container manager and
PID 1 in the container would thus manage the same dir together.

With this change the container manager now passes an immutable directory
to the container and leaves /run/systemd entirely untouched, and managed
exclusively by PID 1 inside the container, which is nice to have clear
separation on who manages what.

In order to make sure systemd then usses the /run/host/inaccesible/
nodes this commit changes PID 1 to look for that dir and if it exists
will symlink it to /run/systemd/inaccessible.

Now, this will work fine if new nspawn and new pid 1 in the container
work together. as then the symlink is created and the difference between
the two dirs won't matter.

For the case where an old nspawn invokes a new PID 1: in this case
things work as they always worked: the dir is managed together.

For the case where different container manager invokes a new PID 1: in
this case the nodes aren't typically passed in, and PID 1 in the
container will try to create them and will likely fail partially (though
gracefully) when trying to create char/block device nodes. THis is fine
though as there are fallbacks in place for that case.

For the case where a new nspawn invokes an old PID1: this is were the
(minor) incompatibily happens: in this case new nspawn will place the
nodes in the /run/host/inaccessible/ subdir, but the PID 1 in the
container won't look for them there. Since the nodes are also not
pre-created in /run/systed/inaccessible/ PID 1 will try to create them
there as if a different container manager sets them up. This is of
course not sexy, but is not a total loss, since as mentioned fallbacks
are in place anyway. Hence I think it's OK to accept this minor
incompatibility.
2020-08-20 10:17:52 +02:00
Lennart Poettering e96ceabac9 nspawn: move $NOTIFY_SOCKET into /run/host/ too
The sd_notify() socket that nspawn binds that the payload can use to
talk to it was previously stored in /run/systemd/nspawn/notify, which is
weird (as in the previous commit) since this makes /run/systemd
something that is cooperatively maintained by systemd inside the
container and nspawn outside of it.

We now have a better place where container managers can put the stuff
they want to pass to the payload: /run/host/, hence let's make use of
that.

This is not a compat breakage, since the sd_notify() protocol is based
on the $NOTIFY_SOCKET env var, where we place the new socket path.
2020-08-20 10:17:48 +02:00
Lennart Poettering 5a27b39518 nspawn/machine: move mount propagation dir to /run/host/incoming
Previously we'd use a directory /run/systemd/nspawn/incoming for
accepting mounts to propagate from the host. This is a bit weird, since
we have a shared namespace: /run/systemd/ contains both stuff managed by
the surround nspawn as well as from the systemd inside.

We now have the /run/host/ hierarchy that has special stuff we want to
pass from host to container. Let's make use of that here, and move this
directory here too.

This is not a compat breakage, since the payload never interfaces with
that directory natively: it's only nspawn and machined that need to
agree on it.
2020-08-20 10:17:25 +02:00
Lennart Poettering af187ab237 dissect: introduce new helper dissected_image_mount_and_warn() and use it everywhere 2020-08-11 22:26:48 +02:00
Zbigniew Jędrzejewski-Szmek 7e62257219
Merge pull request #16308 from bluca/root_image_options
service: add new RootImageOptions feature
2020-08-03 10:04:36 +02:00
Daan De Meyer 6f646e0175 nspawn: Fix incorrect usage of putenv
strv_env_get only returns the environment variable value. putenv expects
KEY=VALUE format strings. Use setenv instead to fix the use.
2020-08-03 09:58:05 +02:00
Luca Boccassi 18d7370587 service: add new RootImageOptions feature
Allows to specify mount options for RootImage.
In case of multi-partition images, the partition number can be prefixed
followed by colon. Eg:

RootImageOptions=1:ro,dev 2:nosuid nodev

In absence of a partition number, 0 is assumed.
2020-07-29 17:17:32 +01:00
Lennart Poettering 2a2e78e969 nspawn: fix MS_SHARED mount propagation for userns containers
We want our OS trees to be MS_SHARED by default, so that our service
namespacing logic can work correctly. Thus in nspawn we mount everything
MS_SHARED when organizing our tree. We do this early on, before changing
the user namespace (if that's requested). However CLONE_NEWUSER actually
resets MS_SHARED to MS_SLAVE for all mounts (so that less privileged
environments can't affect the more privileged ones). Hence, when
invoking it we have to reset things to MS_SHARED afterwards again. This
won't reestablish propagation, but it will make sure we get a new set of
mount peer groups everywhere that then are honoured for the mount
namespaces/propagated mounts set up inside the container further down.
2020-07-23 17:08:39 +02:00
Luca Boccassi ed4512d009 nspawn: set container_host env vars before user arguments
Allows users on the command line to seamlessly override
$container_host_* just like they can override $container_id and
$container
2020-07-20 07:28:22 +02:00
Lennart Poettering 38ccb55731 nss-mymachines: drop support for UID/GID resolving
Now that we make the user/group name resolving available via userdb and
thus nss-systemd, we do not need the UID/GID resolving support in
nss-mymachines anymore. Let's drop it hence.

We keep the module around, since besides UID/GID resolving it also does
hostname resolving, which we care about. (One of those days we should
replace that by some Varlink logic between
nss-resolve/systemd-resolved.service too)

The hooks are kept in the NSS module, but they do not resolve anything
anymore, in order to keep compat at a maximum.
2020-07-14 17:08:12 +02:00
Zbigniew Jędrzejewski-Szmek 55aacd502b
Merge pull request #15891 from bluca/host_os_release
Container Interface: expose the host's os-release metadata to nspawn and portable guests
2020-07-08 23:52:13 +02:00
Luca Boccassi c2923fdcd7 dissect/nspawn: add support for dm-verity root hash signature
Since cryptsetup 2.3.0 a new API to verify dm-verity volumes by a
pkcs7 signature, with the public key in the kernel keyring,
is available. Use it if libcryptsetup supports it.
2020-06-25 08:45:21 +01:00
Lennart Poettering 6b000af4f2 tree-wide: avoid some loaded terms
https://tools.ietf.org/html/draft-knodel-terminology-02
https://lwn.net/Articles/823224/

This gets rid of most but not occasions of these loaded terms:

1. scsi_id and friends are something that is supposed to be removed from
   our tree (see #7594)

2. The test suite defines an API used by the ubuntu CI. We can remove
   this too later, but this needs to be done in sync with the ubuntu CI.

3. In some cases the terms are part of APIs we call or where we expose
   concepts the kernel names the way it names them. (In particular all
   remaining uses of the word "slave" in our codebase are like this,
   it's used by the POSIX PTY layer, by the network subsystem, the mount
   API and the block device subsystem). Getting rid of the term in these
   contexts would mean doing some major fixes of the kernel ABI first.

Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
2020-06-25 09:00:19 +02:00
Luca Boccassi e1bb4b0d1d nspawn: implement container host os-release interface 2020-06-23 12:58:21 +01:00
Luca Boccassi 0389f4fa81 core: add RootHash and RootVerity service parameters
Allow to explicitly pass root hash (explicitly or as a file) and verity
device/file as unit options. Take precedence over implicit checks.
2020-06-23 10:50:09 +02:00
Zbigniew Jędrzejewski-Szmek 9664be199a
Merge pull request #16118 from poettering/inaccessible-fixlets
move $XDG_RUNTIME_DIR/inaccessible/ to $XDG_RUNTIME_DIR/systemd/inaccessible
2020-06-10 10:23:13 +02:00
Lennart Poettering 48b747fa03 inaccessible: move inaccessible file nodes to /systemd/ subdir in runtime dir always
Let's make sure $XDG_RUNTIME_DIR for the user instance and /run for the
system instance is always organized the same way: the "inaccessible"
device nodes should be placed in a subdir of either called "systemd" and
a subdir of that called "inaccessible".

This way we can emphasize the common behaviour, and only differ where
really necessary.

Follow-up for #13823
2020-06-09 16:23:56 +02:00
Luca Boccassi e7cbe5cb9e dissect: support single-filesystem verity images with external verity hash
dm-verity support in dissect-image at the moment is restricted to GPT
volumes.
If the image a single-filesystem type without a partition table (eg: squashfs)
and a roothash/verity file are passed, set the verity flag and mark as
read-only.
2020-06-09 12:19:21 +01:00
Lennart Poettering fb29cdbef2 tree-wide: make sure our control buffers are properly aligned
We always need to make them unions with a "struct cmsghdr" in them, so
that things properly aligned. Otherwise we might end up at an unaligned
address and the counting goes all wrong, possibly making the kernel
refuse our buffers.

Also, let's make sure we initialize the control buffers to zero when
sending, but leave them uninitialized when reading.

Both the alignment and the initialization thing is mentioned in the
cmsg(3) man page.
2020-05-07 14:39:44 +02:00
Motiejus Jakštys 5c4deb9a5c nspawn: mount custom paths before writing to /etc
Consider such configuration:

    $ systemd-nspawn --read-only --timezone=copy --resolv-conf=copy-host \
        --overlay="+/etc::/etc" <...>

Assuming one wants `/` to be read-only, DNS and `/etc/localtime` to
work. One way to do it is to create an overlay filesystem in `/etc/`.
However, systemd-nspawn tries to create `/etc/resolv.conf` and
`/etc/localtime` before mounting the custom paths, while `/` (and, by
extension, `/etc`) is read-only. Thus it fails to create those files.

Mounting custom paths before modifying anything in `/etc/` makes this
possible.

Full example:

```
$ debootstrap buster /var/lib/machines/t1 http://deb.debian.org/debian
$ systemd-nspawn --private-users=false --timezone=copy --resolv-conf=copy-host --read-only --tmpfs=/var --tmpfs=/run --overlay="+/etc::/etc" -D /var/lib/machines/t1 ping -c 1 example.com
Spawning container t1 on /var/lib/machines/t1.
Press ^] three times within 1s to kill container.
ping: example.com: Temporary failure in name resolution
Container t1 failed with error code 130.
```

With the patch:

```
$ sudo ./build/systemd-nspawn --private-users=false --timezone=copy --resolv-conf=copy-host --read-only --tmpfs=/var --tmpfs=/run --overlay="+/etc::/etc" -D /var/lib/machines/t1 ping -qc 1 example.com
Spawning container t1 on /var/lib/machines/t1.
Press ^] three times within 1s to kill container.
PING example.com (93.184.216.34) 56(84) bytes of data.

--- example.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 110.912/110.912/110.912/0.000 ms
Container t1 exited successfully.
```
2020-05-05 09:02:57 +02:00
Lennart Poettering 371d72e05b socket-util: introduce type-safe, dereferencing wrapper CMSG_FIND_DATA around cmsg_find()
let's take this once step further, and add type-safety to cmsg_find(),
and imply the CMSG_DATA() macro for finding the cmsg payload.
2020-04-23 19:41:15 +02:00
Lennart Poettering 0f4a141744
Merge pull request #15504 from poettering/cmsg-find-pure
just the recvmsg_safe() stuff from #15457
2020-04-23 17:28:19 +02:00
Lennart Poettering 3691bcf3c5 tree-wide: use recvmsg_safe() at various places
Let's be extra careful whenever we return from recvmsg() and see
MSG_CTRUNC set. This generally means we ran into a programming error, as
we didn't size the control buffer large enough. It's an error condition
we should at least log about, or propagate up. Hence do that.

This is particularly important when receiving fds, since for those the
control data can be of any size. In particular on stream sockets that's
nasty, because if we miss an fd because of control data truncation we
cannot recover, we might not even realize that we are one off.

(Also, when failing early, if there's any chance the socket might be
AF_UNIX let's close all received fds, all the time. We got this right
most of the time, but there were a few cases missing. God, UNIX is hard
to use)
2020-04-23 09:41:47 +02:00
Lennart Poettering 287b737693 nspawn: refuse politely when we are run in the non-host netns in combination with --image=
Strictly speaking this doesn't really fix #15079, but it at least means
we won't hang anymore.

Fixes: #15079
2020-04-23 09:18:43 +02:00
Lennart Poettering 1433e0f212 nspawn: minor simplification 2020-04-23 09:18:05 +02:00
Lennart Poettering 86775e3524 nspawn: beef up --resolve-conf= modes
Let's add flavours for copying stub/uplink resolv.conf versions.

Let's add a more brutal "replace" mode, where we'll replace any existing
destination file.

Let's also change what "auto" means: instead of copying the static file,
let's use the stub file, so that DNS search info is copied over.

Fixes: #15340
2020-04-22 19:38:04 +02:00
Zbigniew Jędrzejewski-Szmek 162392b75a tree-wide: spellcheck using codespell
Fixes #15436.
2020-04-16 18:00:40 +02:00
Yu Watanabe df883de98a pid1, nspawn: voidify loopback_setup() 2020-03-04 14:18:55 +01:00
Zbigniew Jędrzejewski-Szmek 105a1a36cd tree-wide: fix spelling of lookup and setup verbs
"set up" and "look up" are the verbs, "setup" and "lookup" are the nouns.
2020-03-03 15:02:53 +01:00
Lennart Poettering 4fcb96ce25 nspawn: fsck all images when mounting things
Also, start logging about mount errors, things are hard to debug
otherwise.
2020-01-29 19:29:55 +01:00