Let's rework the D-Bus APIs GetImageOSRelease() to use the new internal
metadata API, to query what it needs to know. Augment it with
GetImageHostname(), GetImageMachineID(), GetImageMachineInfo(), that
expose the other new APIS.
This reverts commit ee043777be.
It broke almost everywhere it touched. The places that
handn't been converted, were mostly followed by special
handling for the invalid PID `0`. That explains why they
tested for `pid < 0` instead of `pid <= 0`.
I think that one was the first commit I reviewed, heh.
UID/GID mapping with userns can be arbitrarily complex. Let's break this
down to a single admin-friendly parameter: let's expose the UID/GID
shift of a container via a new bus call for each container, and let's
show this as part of "machinectl status" if it is not 0.
This should work for pretty much all real-life full OS container setups
(i.e. the stuff machined is suppose to be useful for). For everything
else we generate a clean error, clarifying that we can't expose the
mapping.
This adds a bus call GetImageOSRelease() to the Manager interface that
retrieves the /etc/os-release file of a machine image. It matches the existing
GetMachineOSRelease() call, however operates on a disk image rather than a
running container.
The backend for this call on .raw images is implemented via the generalized
image dissector, which makes this scheme relatively easy to implement.
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
Beef up the existing var_tmp() call, rename it to var_tmp_dir() and add a
matching tmp_dir() call (the former looks for the place for /var/tmp, the
latter for /tmp).
Both calls check $TMPDIR, $TEMP, $TMP, following the algorithm Python3 uses.
All dirs are validated before use. secure_getenv() is used in order to limite
exposure in suid binaries.
This also ports a couple of users over to these new APIs.
The var_tmp() return parameter is changed from an allocated buffer the caller
will own to a const string either pointing into environ[], or into a static
const buffer. Given that environ[] is mostly considered constant (and this is
exposed in the very well-known getenv() call), this should be OK behaviour and
allows us to avoid memory allocations in most cases.
Note that $TMPDIR and friends override both /var/tmp and /tmp usage if set.
This is a follow-up to 5d2036b5f3, and also makes
the "machinectl clean" verb asynchronous, after all it's little more than a
series of image removals.
The changes required to make this happen are a bit more comprehensive as we
need to pass information about deleted images back to the client, as well as
information about the image we failed on if we failed on one. Hence, create a
temporary file in /tmp, serialize that data into, and read it from the parent
after the operation is complete.
This new call returns a file descriptor for the root directory of a container.
This file descriptor may then be used to access the rest of the container's
file system, via openat() and similar calls. Since the file descriptor returned
is for the file system namespace inside of the container it may be used to
access all files of the container exactly the way the container itself would
see them. This is particularly useful for containers run directly from
loopback media, for example via systemd-nspawn's --image= switch. It also
provides access to directories such as /run of a container that are normally
not accessible to the outside of a container.
This replaces PR #2870.
Fixes: #2870
An unlimited quota makes a lot of sense, but we really should try to propagate this onto the loopback file size, since
an infinitely sized file makes no sense.
Fixes: #2314#2253
Issue #2388 suggests the current TasksMax= setting for user processes is to low. Bump it to 12K. Also, bump the
container TasksMax= from 8K to 16K, so that it remains higher than the one for user processes.
(Compare: the kernel default limit for processes system-wide is 32K).
Fixes#2388
GLIB has recently started to officially support the gcc cleanup
attribute in its public API, hence let's do the same for our APIs.
With this patch we'll define an xyz_unrefp() call for each public
xyz_unref() call, to make it easy to use inside a
__attribute__((cleanup())) expression. Then, all code is ported over to
make use of this.
The new calls are also documented in the man pages, with examples how to
use them (well, I only added docs where the _unref() call itself already
had docs, and the examples, only cover sd_bus_unrefp() and
sd_event_unrefp()).
This also renames sd_lldp_free() to sd_lldp_unref(), since that's how we
tend to call our destructors these days.
Note that this defines no public macro that wraps gcc's attribute and
makes it easier to use. While I think it's our duty in the library to
make our stuff easy to use, I figure it's not our duty to make gcc's own
features easy to use on its own. Most likely, client code which wants to
make use of this should define its own:
#define _cleanup_(function) __attribute__((cleanup(function)))
Or similar, to make the gcc feature easier to use.
Making this logic public has the benefit that we can remove three header
files whose only purpose was to define these functions internally.
See #2008.
With this change we understand more than just leaf quota groups for
btrfs file systems. Specifically:
- When we create a subvolume we can now optionally add the new subvolume
to all qgroups its parent subvolume was member of too. Alternatively
it is also possible to insert an intermediary quota group between the
parent's qgroups and the subvolume's leaf qgroup, which is useful for
a concept of "subtree" qgroups, that contain a subvolume and all its
children.
- The remove logic for subvolumes has been updated to optionally remove
any leaf qgroups or "subtree" qgroups, following the logic above.
- The snapshot logic for subvolumes has been updated to replicate the
original qgroup setup of the source, if it follows the "subtree"
design described above. It will not cover qgroup setups that introduce
arbitrary qgroups, especially those orthogonal to the subvolume
hierarchy.
This also tries to be more graceful when setting up /var/lib/machines as
btrfs. For example, if mkfs.btrfs is missing we don't even try to set it
up as loopback device.
Fixes#1559Fixes#1129
Extra details for an action can be supplied when calling polkit's
CheckAuthorization method. Details are a list of key/value string pairs.
Custom policy can use these details when making authorization decisions.
Some of the operations machined/machinectl implement are also very
useful when applied to the host system (such as machinectl login,
machinectl shell or machinectl status), hence introduce a pseudo-machine
by the name of ".host" in machined that refers to the host system, and
may be used top execute operations on the host system with.
This copies the pseudo-image ".host" machined already implements for
image related commands.
(This commit also adds a PK privilege for opening a PTY in a container,
which was previously not accessible for non-root.)
As it turns out machine_name_is_valid() does the exact same thing as
hostname_is_valid() these days, as it just invoked that and checked the
name length was < 64. However, hostname_is_valid() checks the length
against HOST_NAME_MAX anyway (which is 64 on Linux), hence any
additional check is redundant.
We hence replace machine_name_is_valid() by a macro that simply maps it
to hostname_is_valid() but sets the allow_trailing_dot parameter to
false. We also move this this call to hostname-util.h, to the same place
as the hostname_is_valid() declaration.
When looking for the machine belonging to a PID, always look for the
leader first, only then fall back to a cgroup check. We keep direct
track of the leader PID, but only indirectly of the cgroup, hence prefer
the PID.
This new bus call opens an interactive shell in a container. It works
like the existing OpenLogin() call, but does not involve getty, and
instead opens an arbitrary command line.
This is similar to "systemd-run -t -M" but is controlled by a specific
PolicyKit privilege.
This splits up the stopping logic for machines into two steps: first on
machine_stop() we begin with the shutdown of a machine by queuing the
stop method call for it. Then, in machine_finalize() we actually remove
the rest of its runtime context. This mimics closely how sessions are
handled in logind.
This also reworks the GC logic to strictly check the current state of
the machine unit, rather than shortcutting a few cases, like for example
assuming that UnitRemoved really means a machine is gone (which it isn't
since Reloading might trigger it, see #376).
Fixes#376.
Use mfree() where we can.
Drop unnecessary {}.
Drop unnecessary variable declarations.
Cast syscall invocations where explicitly don't care for the return
value to (void).
Reword a comment.
If we get a weird signal, then we should log about it, but not return an
error, since sd-bus will not call us again then anymore, but for these
signals we match here we actually do want to be called on the next
invocation.
Given a container "foo", that maps user id $UID to container user, using
user namespaces, this NSS module extenstion will now map the $UID to a
name "vu-foo-$TUID" for the translated UID $UID.
Similar, userns groups are mapped to "vg-foo-$TGID" for translated GIDs
of $GID.
This simple change should make userns users more discoverable. Also,
given that many tools like "adduser" check NSS before allocating a UID,
should lower the chance of UID range conflicts between tools.
If NULL is specified for the bus it is now automatically derived from
the passed in message.
This commit also changes a number of invocations of sd_bus_send() to
make use of this.
This should simplify the prototype a bit. The bus parameter is redundant
in most cases, and in the few where it matters it can be derived from
the message via sd_bus_message_get_bus().
If a unit is stopped for a moment, we need to invalidate our knowledge
of it, otherwise we might be confused by automatic restarts
This makes reboots for nspawn containers run as service work correctly.
https://bugs.freedesktop.org/show_bug.cgi?id=87428
If /var/lib/machines is mounted as btrfs loopback file system in
/var/lib/machines.raw with this change we automatically grow the file
system as it fills up. After each 10M we write to it during imports, we
check the free disk space, and if the fill level grows beyond 66% we
increase the size of the file system to 3x the fill level (thus lowering
it to 33%).
When the pool size limit is altered with "machinectl set-limit", then
not only set the subvolume quota of the /var/lib/machine subvolume, but
also resize the backing loop file and the btrfs file system on it
dynamically.