Limit size of various tmpfs mounts to 10% of RAM, except volatile root and /var
to 25%. Another exception is made for /dev (also /devs for PrivateDevices) and
/sys/fs/cgroup since no (or very few) regular files are expected to be used.
In addition, since directories, symbolic links, device specials and xattrs are
not counted towards the size= limit, number of inodes is also limited
correspondingly: 4MB size translates to 1k of inodes (assuming 4k each), 10% of
RAM (using 16GB of RAM as baseline) translates to 400k and 25% to 1M inodes.
Because nr_inodes option can't use ratios like size option, there's an
unfortunate side effect that with small memory systems the limit may be on the
too large side. Also, on an extremely small device with only 256MB of RAM, 10%
of RAM for /run may not be enough for re-exec of PID1 because 16MB of free
space is required.
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
This makes --overlay=+/foobar::/foobar work again, i.e. where the middle
parameter is left out. According to the documentation this is supposed
to generate a temporary writable work place in the midle. But it
apparently never did. Weird.
chase_symlinks() would return negative on error, and either a non-negative status
or a non-negative fd when CHASE_OPEN was given. This made the interface quite
complicated, because dependning on the flags used, we would get two different
"types" of return object. Coverity was always confused by this, and flagged
every use of chase_symlinks() without CHASE_OPEN as a resource leak (because it
would this that an fd is returned). This patch uses a saparate output parameter,
so there is no confusion.
(I think it is OK to have functions which return either an error or an fd. It's
only returning *either* an fd or a non-fd that is confusing.)
This reverts commit 72d967df3e.
Revert this because it broke the `norbind` option of the bind flags
because it does bind-mounts unconditionally recursive.
Let's bring the old logic back.
Fixes: #13170
prefix_root() is equivalent to path_join() in almost all ways, hence
let's remove it.
There are subtle differences though: prefix_root() will try shorten
multiple "/" before and after the prefix. path_join() doesn't do that.
This means prefix_root() might return a string shorter than both its
inputs combined, while path_join() never does that. I like the
path_join() semantics better, hence I think dropping prefix_root() is
totally OK. In the end the strings generated by both functon should
always be identical in terms of path_equal() if not streq().
This leaves prefix_roota() in place. Ideally we'd have path_joina(), but
I don't think we can reasonably implement that as a macro. or maybe we
can? (if so, sounds like something for a later PR)
Also add in a few missing OOM checks
The host mounts it like that, nspawn hence should do too.
Moreover, mount the file system after doing CLONEW_NEWIPC so that it
actually reflects the right mqueues. Finally, mount it wthout
considering it fatal, since POSIX mqueue support is little used and it
should be fine not to support it in the kernel.
The function is otherwise generic enough to toggle other bind mount
flags beyond MS_RDONLY (for example: MS_NOSUID or MS_NODEV), hence let's
beef it up slightly to support that too.
This is a pretty large patch, and adds support for OCI runtime bundles
to nspawn. A new switch --oci-bundle= is added that takes a path to an
OCI bundle. The JSON file included therein is read similar to a .nspawn
settings files, however with a different feature set.
Implementation-wise this mostly extends the pre-existing Settings object
to carry additional properties for OCI. However, OCI supports some
concepts .nspawn files did not support yet, which this patch also adds:
1. Support for "masking" files and directories. This functionatly is now
also available via the new --inaccesible= cmdline command, and
Inaccessible= in .nspawn files.
2. Support for mounting arbitrary file systems. (not exposed through
nspawn cmdline nor .nspawn files, because probably not a good idea)
3. Ability to configure the console settings for a container. This
functionality is now also available on the nspawn cmdline in the new
--console= switch (not added to .nspawn for now, as it is something
specific to the invocation really, not a property of the container)
4. Console width/height configuration. Not exposed through
.nspawn/cmdline, but this may be controlled through $COLUMNS and
$LINES like in most other UNIX tools.
5. UID/GID configuration by raw numbers. (not exposed in .nspawn and on
the cmdline, since containers likely have different user tables, and
the existing --user= switch appears to be the better option)
6. OCI hook commands (no exposed in .nspawn/cmdline, as very specific to
OCI)
7. Creation of additional devices nodes in /dev. Most likely not a good
idea, hence not exposed in .nspawn/cmdline. There's already --bind=
to achieve the same, which is the better alternative.
8. Explicit syscall filters. This is not a good idea, due to the skewed
arch support, hence not exposed through .nspawn/cmdline.
9. Configuration of some sysctls on a whitelist. Questionnable, not
supported in .nspawn/cmdline for now.
10. Configuration of all 5 types of capabilities. Not a useful concept,
since the kernel will reduce the caps on execve() anyway. Not
exposed through .nspawn/cmdline as this is not very useful hence.
Note that this only implements the OCI runtime logic itself. It does not
provide a runc-compatible command line tool. This is left for a later
PR. Only with that in place tools such as "buildah" can use the OCI
support in nspawn as drop-in replacement.
Currently still missing is OCI hook support, but it's already parsed and
everything, and should be easy to add. Other than that it's OCI is
implemented pretty comprehensively.
There's a list of incompatibilities in the nspawn-oci.c file. In a later
PR I'd like to convert this into proper markdown and add it to the
documentation directory.
This splits out a bunch of functions from fileio.c that have to do with
temporary files. Simply to make the header files a bit shorter, and to
group things more nicely.
No code changes, just some rearranging of source files.
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.
I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
When a network namespace is needed, /sys is mounted as tmpfs (see commit
d8fc6a000f for details).
But in this case mode 755 was used as initial permissions for /sys whereas the
default mode for sysfs is 555.
In practice using 755 doesn't have any impact because /sys is mounted read-only
too but for consistency, let's use the correct mode.
Fixes: #10050
Currently, mount_sysfs() only creates /sys/fs/cgroup if cg_ns_supported().
The comment explains that we need to "Create mountpoint for
cgroups. Otherwise we are not allowed since we remount /sys read-only.";
that is: that we need to do it now, rather than later. However, the
comment doesn't do anything to explain why we only need to do this if
cg_ns_supported(); shouldn't we _always_ need to do it?
The answer is that if !use_cgns, then this was already done by the outer
child, so mount_sysfs() only needs to do it if use_cgns. Now,
mount_sysfs() doesn't know whether use_cgns, but !cg_ns_supported() implies
!use_cgns, so we can optimize" the case where we _know_ !use_cgns, and deal
with a no-op mkdir_p() in the false-positive where cgns_supported() but
!use_cgns.
But is it really much of an optimization? We're potentially spending an
access(2) (cg_ns_supported() could be cached from a previous call) to
potentially save an lstat(2) and mkdir(2); and all of them are on virtual
fileystems, so they should all be pretty cheap.
So, simplify and drop the conditional. It's a dubious optimization that
requires more text to explain than it's worth.
One of the things that tmpfs_patch_options does is take an (optional) UID,
and insert "uid=${UID},gid=${UID}" into the options string. So we need a
uid_t argument, and a way of telling if we should use it. Fortunately,
that is built in to the uid_t type by having UID_INVALID as a possible
value.
So this is really a feature that requires one argument. Yet, it is somehow
taking 4! That is absurd. Simplify it to only take one argument, and have
that trickle all the way up to mount_all()'s usage.
Now, in may of the uses, the argument becomes
uid_shift == 0 ? UID_INVALID : uid_shift
because it used to treat uid_shift=0 as invalid unless the patch_ids flag
was also set. This keeps the behavior the same. Note that in all cases
where it is invoked, if !use_userns (sometimes called !userns), then
uid_shift is 0; we don't have to add any checks for that.
That said, I'm pretty sure that "uid=0" and not setting "uid=" are the
same, but Christian Brauner seemed to not think so when implementing the
cgns support. https://github.com/systemd/systemd/pull/3589
One of the things that mkdir_userns{,_p}() does is take an (optional) UID,
and chown the directory to that. So we need a uid_t argument, and a way of
telling if we should use that uid_t argument. Fortunately, that is built
in to the uid_t type by having UID_INVALID as a possible value.
However, currently mkdir_userns() also takes a MountSettingsMask and checks
a couple of bits in it to decide if it should perform the chown.
Drop the mask argument, and instead have the caller pass UID_INVALID if it
shouldn't chown.
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
This tightens security on /proc: a couple of files exposed there are now
made inaccessible. These files might potentially leak kernel internals
or expose non-virtualized concepts, hence lock them down by default.
Moreover, a couple of dirs in /proc that expose stuff also exposed in
/sys are now marked read-only, similar to how we handle /sys.
The list is taken from what docker/runc based container managers
generally apply, but slightly extended.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
This macro will read a pointer of any type, return it, and set the
pointer to NULL. This is useful as an explicit concept of passing
ownership of a memory area between pointers.
This takes inspiration from Rust:
https://doc.rust-lang.org/std/option/enum.Option.html#method.take
and was suggested by Alan Jenkins (@sourcejedi).
It drops ~160 lines of code from our codebase, which makes me like it.
Also, I think it clarifies passing of ownership, and thus helps
readability a bit (at least for the initiated who know the new macro)