Currently, mount_sysfs() only creates /sys/fs/cgroup if cg_ns_supported().
The comment explains that we need to "Create mountpoint for
cgroups. Otherwise we are not allowed since we remount /sys read-only.";
that is: that we need to do it now, rather than later. However, the
comment doesn't do anything to explain why we only need to do this if
cg_ns_supported(); shouldn't we _always_ need to do it?
The answer is that if !use_cgns, then this was already done by the outer
child, so mount_sysfs() only needs to do it if use_cgns. Now,
mount_sysfs() doesn't know whether use_cgns, but !cg_ns_supported() implies
!use_cgns, so we can optimize" the case where we _know_ !use_cgns, and deal
with a no-op mkdir_p() in the false-positive where cgns_supported() but
!use_cgns.
But is it really much of an optimization? We're potentially spending an
access(2) (cg_ns_supported() could be cached from a previous call) to
potentially save an lstat(2) and mkdir(2); and all of them are on virtual
fileystems, so they should all be pretty cheap.
So, simplify and drop the conditional. It's a dubious optimization that
requires more text to explain than it's worth.
One of the things that tmpfs_patch_options does is take an (optional) UID,
and insert "uid=${UID},gid=${UID}" into the options string. So we need a
uid_t argument, and a way of telling if we should use it. Fortunately,
that is built in to the uid_t type by having UID_INVALID as a possible
value.
So this is really a feature that requires one argument. Yet, it is somehow
taking 4! That is absurd. Simplify it to only take one argument, and have
that trickle all the way up to mount_all()'s usage.
Now, in may of the uses, the argument becomes
uid_shift == 0 ? UID_INVALID : uid_shift
because it used to treat uid_shift=0 as invalid unless the patch_ids flag
was also set. This keeps the behavior the same. Note that in all cases
where it is invoked, if !use_userns (sometimes called !userns), then
uid_shift is 0; we don't have to add any checks for that.
That said, I'm pretty sure that "uid=0" and not setting "uid=" are the
same, but Christian Brauner seemed to not think so when implementing the
cgns support. https://github.com/systemd/systemd/pull/3589
One of the things that mkdir_userns{,_p}() does is take an (optional) UID,
and chown the directory to that. So we need a uid_t argument, and a way of
telling if we should use that uid_t argument. Fortunately, that is built
in to the uid_t type by having UID_INVALID as a possible value.
However, currently mkdir_userns() also takes a MountSettingsMask and checks
a couple of bits in it to decide if it should perform the chown.
Drop the mask argument, and instead have the caller pass UID_INVALID if it
shouldn't chown.
nspawn as it is now is a generally useful tool, hence let's drop the
comments about it being useful for debug and so on only.
The new wording just makes the first sentence of the main page also the
summary.
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
Let's always write "1 << 0", "1 << 1" and so on, except where we need
more than 31 flag bits, where we write "UINT64(1) << 0", and so on to force
64bit values.
This is mostly fall-out from d1a1f0aaf0,
however some cases are older bugs.
There might be more issues lurking, this was a simple grep for "%m"
across the tree, with all lines removed that mention "errno" at all.
This makes it easier to link the nspawn implementation to the tests.
Right now this just means that nspawn-patch-uid.c is not compiled
twice, which is nice, but results in test-patch-uid being slightly bigger,
which is not nice. But in general, we should use convenience libs to
compile everything just once, as far as possible. Otherwise, once we
start compiling a few files here twice, and a few file there thrice, we
soon end up in a state where we are doing hundreds of extra compilations.
So let's do the "right" thing, even if is might not be more efficient.
Let's rework error handling a bit in image_find() and friends: when we
can't find an image, return -ENOENT rather than 0. That's better as
before we violated the usual rule in our codebase that return parameters
are initialized when the return value is >= 0 and otherwise not touched.
This also makes enumeration and validation a bit more strict: we'll only
accept ".raw" as suffix for regular files, and filter out this suffix
handling on directories/subvolumes, where it makes no sense.
This distuingishes two different classes of images, one for the purpose
of npsawn-like containers, i.e. "machines", and one for portable
services.
This distinction is mostly about search paths. We look for machine
images in /var/lib/machines and for portable images in
/var/lib/portables.
We already do this kind of validation in nspawn when we operate on a
plain directory, let's also do this on raw images under the same
condition: that we are about too boot the image. Also, do this when we
are about to read OS metadata from it.
If we log to the pty that is configured as stdin/stdout/stderr of the
container too early we risk filling it up in full before we start
processing the pty from the parent process, resulting in deadlocks.
Let's hence keep a copy of the original tty we were started on before
setting up stdin/stdout/stderr, so that we can log to it, and keep using
it as long as we can.
Since the kernel's pty internal buffer is pretty small this actually
triggered deadlocks when we debug logged at lot from nspawn's child
processes, see: https://github.com/systemd/systemd/pull/9024#issuecomment-390403674
With this change we won't use the pty at all, only the actual payload we
start will, and hence we won't deadlock on it, ever.
Let's make use of log_set_open_when_needed() in nspawn too, i.e. at the
point where we close logging because we are about to rearrange fds,
let's automatically reopen the logging fds when we need them, the same
way as we do that in the service manager. This makes things simpler and
more robust.
This makes most header files easier to look at. Also Emacs gets really
slow when browsing through large sections of overly long prototypes,
which is much improved by this macro.
We should probably not do something similar with too many other cases,
as macros like this might help readability for some, but make it worse
for others. But I think given the complexity of this specific prototype
and how often we use it, it's worth doing.
The use of UINT64_C() in the SettingsMask enum definition is misleading:
it does not mean that individual fields have this width. E.g., with
enum {
FOO = UINT64_C(1)
}
sizeof(FOO) gives 4. It only means that the shift is done properly. So
1 << 35 is undefined, but UINT64_C(1) << 35 is the expected 64 bit
constant. Thus, the use UINT64_C() is useful, because we know that the shifts
are done properly, no matter what the value of _RLIMIT_MAX is, but when those
fields are used in expressions, we don't know what size they will be
(probably 4). Let's add a define which "hides" the enum definition behind a
define which gives the same value but is actually 64 bit. I think this is a
nicer solution than requiring all users to cast SETTING_RLIMIT_FIRST before
use.
Fixes#9035.
Let's separate the loading of the settings object and the merging into
our arg_xyz fields into two.
This will become particularly useful when we eventually are able to load
settings from OCI runtime files in addition to .nspawn files.
Similar as the other options added before, this is primarily useful to
provide comprehensive OCI runtime compatbility, but might be useful
otherwise, too.
This simply controls the PR_SET_NO_NEW_PRIVS flag for the container.
This too is primarily relevant to provide OCI runtime compaitiblity, but
might have other uses too, in particular as it nicely complements the
existing --capability= and --drop-capability= flags.
Previously, the container's hostname was exclusively initialized from
the machine name configured with --machine=, i.e. the internal name and
the external name used for and by the container was synchronized. This
adds a new option --hostname= that optionally allows the internal name
to deviate from the external name.
This new option is mainly useful to ultimately implement the OCI runtime
spec directly in nspawn, but it might be useful on its own for some
other usecases too.