Commit Graph

76 Commits

Author SHA1 Message Date
Lennart Poettering ae2a15bc14 macro: introduce TAKE_PTR() macro
This macro will read a pointer of any type, return it, and set the
pointer to NULL. This is useful as an explicit concept of passing
ownership of a memory area between pointers.

This takes inspiration from Rust:

https://doc.rust-lang.org/std/option/enum.Option.html#method.take

and was suggested by Alan Jenkins (@sourcejedi).

It drops ~160 lines of code from our codebase, which makes me like it.
Also, I think it clarifies passing of ownership, and thus helps
readability a bit (at least for the initiated who know the new macro)
2018-03-22 20:21:42 +01:00
Zbigniew Jędrzejewski-Szmek aa484f3561 tree-wide: use reallocarray instead of our home-grown realloc_multiply (#8279)
There isn't much difference, but in general we prefer to use the standard
functions. glibc provides reallocarray since version 2.26.

I moved explicit_bzero is configure test to the bottom, so that the two stdlib
functions are at the bottom.
2018-02-26 21:20:00 +01:00
Yu Watanabe 72d967df3e nspawn: remove unnecessary mount option parsing logic 2018-02-21 09:06:55 +09:00
Yu Watanabe 30ffb010ff nspawn: fix indentation 2018-02-21 09:05:33 +09:00
Zbigniew Jędrzejewski-Szmek dae8b82eb9 Add mkdir_errno_wrapper() and use instead of mkdir() in various places
We'd pass pointers to mkdir and mkdir_label to call in various places. mkdir
returns the error in errno while mkdir_label returns the error directly.
2017-12-16 13:28:22 +01:00
Zbigniew Jędrzejewski-Szmek 40fd52f28d util-lib: rename path_check_fstype to path_is_fs_type 2017-11-30 20:43:25 +01:00
Daniel Lockyer 87e4e28dcf Replace empty ternary with helper method 2017-11-24 09:31:08 +00:00
Lennart Poettering 6925a0de4e cgroup-util: move Set* allocation into cg_kernel_controllers()
Previously, callers had to do this on their own. Let's make the call do
that instead, making the caller code a bit shorter.
2017-11-21 11:54:08 +01:00
Lennart Poettering bf516294c8 nspawn: minor optimization
no need to prepare the target path if we quite the loop anyway one step
later.
2017-11-21 11:54:08 +01:00
Lennart Poettering d7c9693a3e nspawn-mount: rework get_controllers() a bit
Let's rename get_controllers() → get_process_controllers(), in order to
underline the difference to cg_kernel_controllers(). After all, one
returns the controllers available to the process, the other the
controllers enabled in the kernel at all).

Let's also update the code to use read_line() and set_put_strdup() to
shorten the code a bit, and make it more robust.
2017-11-21 11:54:08 +01:00
Lennart Poettering ea9053c5f8 nspawn: rework mount_systemd_cgroup_writable() a bit
We shouldn't call alloca() as part of function calls, that's not really
defined in C. Hence, let's first do our stack allocations, and then
invoke functions.

Also, some coding style fixes, and minor shuffling around.

No functional changes.
2017-11-21 11:54:08 +01:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Lauri Tirkkonen 4f13e53428 nspawn: EROFS for chowning mount points is not fatal (#7122)
This fixes --read-only with --private-users. mkdir_userns_p may return
-EROFS if either mkdir or lchown fails; lchown failing is fine as the
mount point will just be overmounted, and if mkdir fails then the
following mount() will also fail (with ENOENT).
2017-10-24 19:40:50 +02:00
Zbigniew Jędrzejewski-Szmek 349cc4a507 build-sys: use #if Y instead of #ifdef Y everywhere
The advantage is that is the name is mispellt, cpp will warn us.

$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build

squash! build-sys: use #if Y instead of #ifdef Y everywhere

v2:
- fix incorrect setting of HAVE_LIBIDN2
2017-10-04 12:09:29 +02:00
Zbigniew Jędrzejewski-Szmek b167945935 nspawn: do not mount /sys/fs/kdbus 2017-07-23 12:03:00 -04:00
tomty89 e8a94ce83e nspawn: add nosuid and nodev to /tmp mount (#6004)
When automatic /tmp mount was introduced to nspawn in v219, it was done without having the nosuid and nodev mount options, which was the same case as systemd's default tmp.mount unit back then.

nosuid and nodev was added to tmp.mount(.m4) in v231 for security reasons. matching the nspawn /tmp mount entry against that.

Ref.:
2f9df7c96a
bbb99c30d0
2017-05-23 09:41:36 +02:00
Zbigniew Jędrzejewski-Szmek 78e4f19ebc Merge pull request #5444 from poettering/cgroups-revert-no-error
Revert "core: simplify cg_[all_]unified()" and more.
2017-02-24 18:48:57 -05:00
AsciiWolf 13e785f7a0 Fix missing space in comments (#5439) 2017-02-24 18:14:02 +01:00
Lennart Poettering b4cccbc13a cgroup: change cg_unified() to possibly return errors again
We use our cgroup APIs in various contexts, including from our libraries
sd-login, sd-bus. As we don#t control those environments we can't rely
that the unified cgroup setup logic succeeds, and hence really shouldn't
assert on it.

This more or less reverts 415fc41cea.
2017-02-24 17:52:58 +01:00
Tejun Heo 2977724b09 core: make hybrid cgroup unified mode keep compat /sys/fs/cgroup/systemd hierarchy
Currently the hybrid mode mounts cgroup v2 on /sys/fs/cgroup instead of the v1
name=systemd hierarchy.  While this works fine for systemd itself, it breaks
tools which expect cgroup v1 hierarchy on /sys/fs/cgroup/systemd.

This patch updates the hybrid mode so that it mounts v2 hierarchy on
/sys/fs/cgroup/unified and keeps v1 "name=systemd" hierarchy on
/sys/fs/cgroup/systemd for compatibility.  systemd itself doesn't depend on the
"name=systemd" hierarchy at all.  All operations take place on the v2 hierarchy
as before but the v1 hierarchy is kept in sync so that any tools which expect
it to be there can keep doing so.  This allows systemd to take advantage of
cgroup v2 process management without requiring other tools to be aware of the
hybrid mode.

The hybrid mode is implemented by mapping the special systemd controller to
/sys/fs/cgroup/unified and making the basic cgroup utility operations -
cg_attach(), cg_create(), cg_rmdir() and cg_trim() - also operate on the
/sys/fs/cgroup/systemd hierarchy whenever the cgroup2 hierarchy is updated.

While a bit messy, this will allow dropping complications from using cgroup v1
for process management a lot sooner than otherwise possible which should make
it a net gain in terms of maintainability.

v2: Fixed !cgns breakage reported by @evverx and renamed the unified mount
    point to /sys/fs/cgroup/unified as suggested by @brauner.

v3: chown the compat hierarchy too on delegation.  Suggested by @evverx.

v4: [zj]
- drop the change to default, full "legacy" is still the default.
2017-02-20 12:28:35 -05:00
Tejun Heo 415fc41cea core: simplify cg_[all_]unified()
cg_[all_]unified() test whether a specific controller or all controllers are on
the unified hierarchy.  While what's being asked is a simple binary question,
the callers must assume that the functions may fail any time, which
unnecessarily complicates their usages.  This complication is unnecessary.
Internally, the test result is cached anyway and there are only a few places
where the test actually needs to be performed.

This patch simplifies cg_[all_]unified().

* cg_[all_]unified() are updated to return bool.  If the result can't be
  decided, assertion failure is triggered.  Error handlings from their callers
  are dropped.

* cg_unified_flush() is updated to calculate the new result synchrnously and
  return whether it succeeded or not.  Places which need to flush the test
  result are updated to test for failure.  This ensures that all the following
  cg_[all_]unified() tests succeed.

* Places which expected possible cg_[all_]unified() failures are updated to
  call and test cg_unified_flush() before calling cg_[all_]unified().  This
  includes functions used while setting up mounts during boot and
  manager_setup_cgroup().
2017-02-18 17:51:13 -05:00
Philip Withnall b53ede699c nspawn: Add support for sysroot pivoting (#5258)
Add a new --pivot-root argument to systemd-nspawn, which specifies a
directory to pivot to / inside the container; while the original / is
pivoted to another specified directory (if provided). This adds
support for booting container images which may contain several bootable
sysroots, as is common with OSTree disk images. When these disk images
are booted on real hardware, ostree-prepare-root is run in conjunction
with sysroot.mount in the initramfs to achieve the same results.
2017-02-08 16:54:31 +01:00
Lennart Poettering a4c35b6b4d nspawn: split out VolatileMode definitions
This moves the VolatileMode enum and its helper functions to src/shared/. This
is useful to then reuse them to implement systemd.volatile= in a later commit.
2016-12-20 20:00:08 +01:00
Evgeny Vereshchagin c9fd987279 nspawn: don't hide --bind=/tmp/* mounts (#4824)
Fixes #4789
2016-12-05 18:14:05 +01:00
Lennart Poettering cb638b5e96 util-lib: rename CHASE_NON_EXISTING → CHASE_NONEXISTENT
As suggested by @keszybz
2016-12-01 12:49:55 +01:00
Lennart Poettering ec57bd426a nspawn: improve log messages
When complaining about the inability to resolve a path, show the full path, not
just the relative one.

As suggested by @keszybz.
2016-12-01 12:41:18 +01:00
Lennart Poettering c7a4890ce4 nspawn: optionally, automatically allocated --bind=/--overlay source from /var/tmp
This extends the --bind= and --overlay= syntax so that an empty string as source/upper
directory is taken as request to automatically allocate a temporary directory
below /var/tmp, whose lifetime is bound to the nspawn runtime. In combination
with the "+" path extension this permits a switch "--overlay=+/var::/var" in
order to use the container's shipped /var, combine it with a writable temporary
directory and mount it to the runtime /var of the container.
2016-12-01 12:41:18 +01:00
Lennart Poettering 86c0dd4a71 nspawn: permit prefixing of source paths in --bind= and --overlay= with "+"
If a source path is prefixed with "+" it is taken relative to the container's
root directory instead of the host. This permits easily establishing bind and
overlay mounts based on data from the container rather than the host.

This also reworks custom_mounts_prepare(), and turns it into two functions: one
custom_mount_check_all() that remains in nspawn.c but purely verifies the
validity of the custom mounts configured. And one called
custom_mount_prepare_all() that actually does the preparation step, sorts the
custom mounts, resolves relative paths, and allocates temporary directories as
necessary.
2016-12-01 12:41:18 +01:00
Lennart Poettering ad85779a50 nspawn: split out overlayfs argument parsing into a function of its own
Add overlay_mount_parse() similar in style to tmpfs_mount_parse() and
bind_mount_parse().
2016-12-01 00:25:51 +01:00
Lennart Poettering 48cbe5f80b nspawn: use -ENOMEM instead of log_oom() in one case
The function is of the "library" kind and doesn't log ENOMEM in all other
cases, hence fix the one outlier.
2016-12-01 00:25:51 +01:00
Lennart Poettering 8ce48cf0f8 nspawn: use the new CHASE_NON_EXISTING flag when resolving mount points
This restores the ability to implicitly create files/directories to mount
specified mount points on.
2016-12-01 00:25:51 +01:00
Lennart Poettering c4f4fce79e fs-util: add flags parameter to chase_symlinks()
Let's remove chase_symlinks_prefix() and instead introduce a flags parameter to
chase_symlinks(), with a flag CHASE_PREFIX_ROOT that exposes the behaviour of
chase_symlinks_prefix().
2016-12-01 00:25:51 +01:00
Lennart Poettering 68cf43c315 nspawn: use chase_symlinks() on all paths specified via --tmpfs=, --bind= and so on
Fixes: #2860
2016-12-01 00:25:51 +01:00
Lennart Poettering 4da92e5857 nspawn: coding style: don't mix variable declarations and function calls 2016-12-01 00:25:51 +01:00
Lennart Poettering 5639193139 nspawn: use realloc_multiply() where it makes sense 2016-12-01 00:25:51 +01:00
Lennart Poettering e187369587 tree-wide: stop using canonicalize_file_name(), use chase_symlinks() instead
Let's use chase_symlinks() everywhere, and stop using GNU
canonicalize_file_name() everywhere. For most cases this should not change
behaviour, however increase exposure of our function to get better tested. Most
importantly in a few cases (most notably nspawn) it can take the correct root
directory into account when chasing symlinks.
2016-12-01 00:25:51 +01:00
Lennart Poettering acbbf69b71 nspawn: don't require chown() if userns is not on
Fixes: #4711
2016-11-22 13:35:24 +01:00
Sergiusz Urbaniak 4f086aab52
nspawn: R/W support for /sys, and /proc/sys
This commit adds the possibility to leave /sys, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sys, and /proc/sys will be read-write.
If set to "no", /sys, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
https://github.com/systemd/systemd/pull/4018#r76971862 where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
2016-11-18 09:50:40 +01:00
tblume bdb4e0cb64 systemd-nspawn: decrease non-fatal mount errors to debug level (#4569)
non-fatal mount errors shouldn't be logged as warnings.
2016-11-07 08:20:43 -05:00
Lennart Poettering 493fd52f1a Merge pull request #4510 from keszybz/tree-wide-cleanups
Tree wide cleanups
2016-11-03 13:59:20 -06:00
Evgeny Vereshchagin 63eae72312 nspawn: really lchown(uid/gid)
https://github.com/systemd/systemd/pull/4372#issuecomment-253723849:

* `mount_all (outer_child)` creates `container_dir/sys/fs/selinux`
* `mount_all (outer_child)` doesn't patch `container_dir/sys/fs` and so on.
* `mount_sysfs (inner_child)` tries to create `/sys/fs/cgroup`
* This fails

370   stat("/sys/fs", {st_dev=makedev(0, 28), st_ino=13880, st_mode=S_IFDIR|0755, st_nlink=3, st_uid=65534, st_gid=65534, st_blksize=4096, st_blocks=0, st_size=60, st_atime=2016/10/14-05:16:43.398665943, st_mtime=2016/10/14-05:16:43.399665943, st_ctime=2016/10/14-05:16:43.399665943}) = 0
370   mkdir("/sys/fs/cgroup", 0755)     = -1 EACCES (Permission denied)

* `mount_syfs (inner_child)` ignores that error and

mount(NULL, "/sys", NULL, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL) = 0

* `mount_cgroups` finally fails
2016-10-23 23:23:40 -04:00
Zbigniew Jędrzejewski-Szmek 9aa2169eae nspawn: use the return value from asprintf instead of checking the pointer
If allocation fails, the value of the point is "undefined". In practice
this matters very little, but for consistency with rest of the code,
let's check the return value.
2016-10-23 11:57:55 -04:00
Zbigniew Jędrzejewski-Szmek 605405c6cc tree-wide: drop NULL sentinel from strjoin
This makes strjoin and strjoina more similar and avoids the useless final
argument.

spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/systemd -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libsystemd/sd-bus -I ./src/libsystemd/sd-event -I ./src/libsystemd/sd-login -I ./src/libsystemd/sd-netlink -I ./src/libsystemd/sd-network -I ./src/libsystemd/sd-hwdb -I ./src/libsystemd/sd-device -I ./src/libsystemd/sd-id128 -I ./src/libsystemd-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/*.c)

git grep -e '\bstrjoin\b.*NULL' -l|xargs sed -i -r 's/strjoin\((.*), NULL\)/strjoin(\1)/'

This might have missed a few cases (spatch has a really hard time dealing
with _cleanup_ macros), but that's no big issue, they can always be fixed
later.
2016-10-23 11:43:27 -04:00
Lennart Poettering 18e51a022c Merge pull request #4351 from keszybz/nspawn-debugging
Enhance nspawn debug logs for mount/unmount operations
2016-10-12 11:21:11 +02:00
Evgeny Vereshchagin 8492849ee5 nspawn: let's mount(/tmp) inside the user namespace (#4340)
Fixes:
host# systemd-nspawn -D ... -U -b systemd.unit=multi-user.target
...
$ grep /tmp /proc/self/mountinfo
154 145 0:41 / /tmp rw - tmpfs tmpfs rw,seclabel,uid=1036124160,gid=1036124160

$ umount /tmp
umount: /root/tmp: not mounted

$ systemctl poweroff
...
[FAILED] Failed unmounting Temporary Directory.
2016-10-11 17:18:27 -04:00
Zbigniew Jędrzejewski-Szmek 60e76d4897 nspawn,mount-util: add [u]mount_verbose and use it in nspawn
This makes it easier to debug failed nspawn invocations:

Mounting sysfs on /var/lib/machines/fedora-rawhide/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV "")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev (MS_NOSUID|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev/shm (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=1777,uid=1450901504,gid=1450901504")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/run (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")...
Bind-mounting /sys/fs/selinux on /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_BIND "")...
Remounting /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Mounting proc on /proc (MS_NOSUID|MS_NOEXEC|MS_NODEV "")...
Bind-mounting /proc/sys on /proc/sys (MS_BIND "")...
Remounting /proc/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Bind-mounting /proc/sysrq-trigger on /proc/sysrq-trigger (MS_BIND "")...
Remounting /proc/sysrq-trigger (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Mounting tmpfs on /tmp (MS_STRICTATIME "mode=1777,uid=0,gid=0")...
Mounting tmpfs on /sys/fs/cgroup (MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_STRICTATIME "mode=755,uid=0,gid=0")...
Mounting cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr")...
Failed to mount cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr"): No such file or directory
2016-10-11 16:50:07 -04:00
Zbigniew Jędrzejewski-Szmek add554f4e1 nspawn: small cleanups in get_controllers()
- check for oom after strdup
- no need to truncate the line since we're only extracting one field anyway
- use STR_IN_SET
2016-10-11 16:46:58 -04:00
Zbigniew Jędrzejewski-Szmek ada5412039 nspawn: simplify arg_us_cgns passing
We would check the condition cg_ns_supported() twice. No functional
change.
2016-10-11 16:46:58 -04:00
Lennart Poettering 920a7899de nspawn: let's mount /proc/sysrq-trigger read-only by default
LXC does this, and we should probably too. Better safe than sorry.
2016-09-25 10:42:18 +02:00
Lennart Poettering 6b7c9f8bce namespace: rework how ReadWritePaths= is applied
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths=
specification, then we'd first recursively apply the ReadOnlyPaths= paths, and
make everything below read-only, only in order to then flip the read-only bit
again for the subdirs listed in ReadWritePaths= below it.

This is not only ugly (as for the dirs in question we first turn on the RO bit,
only to turn it off again immediately after), but also problematic in
containers, where a container manager might have marked a set of dirs read-only
and this code will undo this is ReadWritePaths= is set for any.

With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be
applied to the children listed in ReadWritePaths= in the first place, so that
we do not need to turn off the RO bit for those after all.

This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the
RO bit, but never to turn it off again. Or to say this differently: if some
dirs are marked read-only via some external tool, then ReadWritePaths= will not
undo it.

This is not only the safer option, but also more in-line with what the man page
currently claims:

        "Entries (files or directories) listed in ReadWritePaths= are
        accessible from within the namespace with the same access rights as
        from outside."

To implement this change bind_remount_recursive() gained a new "blacklist"
string list parameter, which when passed may contain subdirs that shall be
excluded from the read-only mounting.

A number of functions are updated to add more debug logging to make this more
digestable.
2016-09-25 10:40:51 +02:00