Commit Graph

603 Commits

Author SHA1 Message Date
Lennart Poettering ae2a15bc14 macro: introduce TAKE_PTR() macro
This macro will read a pointer of any type, return it, and set the
pointer to NULL. This is useful as an explicit concept of passing
ownership of a memory area between pointers.

This takes inspiration from Rust:

https://doc.rust-lang.org/std/option/enum.Option.html#method.take

and was suggested by Alan Jenkins (@sourcejedi).

It drops ~160 lines of code from our codebase, which makes me like it.
Also, I think it clarifies passing of ownership, and thus helps
readability a bit (at least for the initiated who know the new macro)
2018-03-22 20:21:42 +01:00
Lennart Poettering 4526113f57 dissect: add dissect_image_and_warn() that unifies error message generation for dissect_image() (#8517) 2018-03-21 12:10:01 +01:00
Zbigniew Jędrzejewski-Szmek 0441378080 nspawn: move network namespace creation to a separate step (#8430)
Fixes #8427.

Unsharing the namespace in a separate step changes the ownership of
/proc/net/ip_tables_names (and related files) from nobody:nobody to
root:root. See [1] and [2] for all the details.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f13f2aeed154da8e48f90b85e720f8ba39b1e881
[2] https://bugzilla.netfilter.org/show_bug.cgi?id=1064#c9
2018-03-20 18:07:17 +01:00
Lennart Poettering 2b33ab0957 tree-wide: port various places over to use new rearrange_stdio() 2018-03-02 11:42:10 +01:00
Zbigniew Jędrzejewski-Szmek 8405dcf752 nspawn: make sure we don't leak the fd in chase_symlinks_and_update
No callers use CHASE_OPEN right now, but let's be defensive.
2018-02-15 10:18:25 +01:00
Lennart Poettering d72495759b tree-wide: port all code to use safe_getcwd() 2018-01-17 11:17:38 +01:00
Lennart Poettering 75152a4d6a tree-wide: install matches asynchronously
Let's remove a number of synchronization points from our service
startups: let's drop synchronous match installation, and let's opt for
asynchronous instead.

Also, let's use sd_bus_match_signal() instead of sd_bus_add_match()
where we can.
2018-01-05 13:58:32 +01:00
Lennart Poettering d2e0ac3d1e tree-wide: unify the process name we pass to wait_for_terminate_and_check() with the one we pass to safe_fork() 2018-01-04 13:27:27 +01:00
Lennart Poettering 7d4904fe7a process-util: rework wait_for_terminate_and_warn() to take a flags parameter
This renames wait_for_terminate_and_warn() to
wait_for_terminate_and_check(), and adds a flags parameter, that
controls how much to log: there's one flag that means we log about
abnormal stuff, and another one that controls whether we log about
non-zero exit codes. Finally, there's a shortcut flag value for logging
in both cases, as that's what we usually use.

All callers are accordingly updated. At three occasions duplicate logging
is removed, i.e. where the old function was called but logged in the
caller, too.
2018-01-04 13:27:27 +01:00
Zbigniew Jędrzejewski-Szmek dae8b82eb9 Add mkdir_errno_wrapper() and use instead of mkdir() in various places
We'd pass pointers to mkdir and mkdir_label to call in various places. mkdir
returns the error in errno while mkdir_label returns the error directly.
2017-12-16 13:28:22 +01:00
Zbigniew Jędrzejewski-Szmek bdd2bbc445
Merge pull request #7469 from kinvolk/dongsu/nspawn-netns
nspawn: introduce an option for specifying network namespace path
2017-12-14 22:47:57 +01:00
Lennart Poettering fbd0b64f44
tree-wide: make use of new STRLEN() macro everywhere (#7639)
Let's employ coccinelle to do this for us.

Follow-up for #7625.
2017-12-14 19:02:29 +01:00
Dongsu Park d7bea6b629 nspawn: introduce an option for specifying network namespace path
Add a new option `--network-namespace-path` to systemd-nspawn to allow
users to specify an arbitrary network namespace, e.g. `/run/netns/foo`.
Then systemd-nspawn will open the netns file, pass the fd to
outer_child, and enter the namespace represented by the fd before
running inner_child.

```
$ sudo ip netns add foo
$ mount | grep /run/netns/foo
nsfs on /run/netns/foo type nsfs (rw)
...
$ sudo systemd-nspawn -D /srv/fc27 --network-namespace-path=/run/netns/foo \
  /bin/readlink -f /proc/self/ns/net
/proc/1/ns/net:[4026532009]
```

Note that the option `--network-namespace-path=` cannot be used together
with other network-related options such as `--private-network` so that
the options do not conflict with each other.

Fixes https://github.com/systemd/systemd/issues/7361
2017-12-13 10:21:06 +00:00
Lennart Poettering fba868fa71 tree-wide: unify logging of "Must be root" message
Let's unify this in one call, generalizing must_be_root() from
bootctl.c.
2017-12-11 23:19:45 +01:00
Lennart Poettering 8fd010bb1b nspawn: turn on watchdog logic for nspawn too
It's a long-running daemon, and it's easy to enable, hence do it.
2017-12-07 12:34:46 +01:00
Lennart Poettering 87d5e4f286 build-sys: make the dynamic UID range, and the container UID range configurable
Also, export these ranges in our pkg-config files.
2017-12-06 12:55:37 +01:00
Lennart Poettering de54e02d5e nspawn: when in hybrid mode, chown() both the legacy and the unified hierarchy to the root in the container
If user namespacing is used, let's make sure that the root user in the
container gets access to both /sys/fs/cgroup/systemd and
/sys/fs/cgroup/unified.

This matches similar logic in cg_set_access().
2017-12-05 13:49:13 +01:00
Lennart Poettering 2d3a5a73e0 nspawn: make sure images containing an ESP are compatible with userns -U mode
In -U mode we might need to re-chown() all files and directories to
match the UID shift we want for the image. That's problematic on fat
partitions, such as the ESP (and which is generated by mkosi's
--bootable switch), because fat of course knows no UID/GID file
ownership natively.

With this change we take benefit of the uid= and gid= mount options FAT
knows: instead of chown()ing all files and directories we can just
specify the right UID/GID to use at mount time.

This beefs up the image dissection logic in two ways:

1. First of all support for mounting relevant file systems with
   uid=/gid= is added: when a UID is specified during mount it is used for
   all applicable file systems.

2. Secondly, two new mount flags are added:
   DISSECT_IMAGE_MOUNT_ROOT_ONLY and DISSECT_IMAGE_MOUNT_NON_ROOT_ONLY.
   If one is specified the mount routine will either only mount the root
   partition of an image, or all partitions except the root partition.
   This is used by nspawn: first the root partition is mounted, so that
   we can determine the UID shift in use so far, based on ownership of
   the image's root directory. Then, we mount the remaining partitions
   in a second go, this time with the right UID/GID information.
2017-12-05 13:49:12 +01:00
Lennart Poettering 8199d554c1 nspawn: figure out cgroup mode *after* mounting image
If we operate on a disk image (i.e. --image=) then it's pointless to
look into the mount directory before it is actually mounted to see which
systemd version is running inside...

Unfortunately we only mount the disk image in the child process, but the
parent needs to know the cgroup mode, hence add some IPC for this
purpose and communicate the cgroup mode determined from the image back
to the parent.
2017-12-05 13:49:12 +01:00
Yu Watanabe 62b1e758d3 nspawn: adjust path to static resolv.conf to support split usr
Fixes #7302.
2017-11-25 21:11:07 +09:00
Lennart Poettering d381c8a6bf nspawn: hash the machine name, when looking for a suitable UID base (#7437)
When "-U" is used we look for a UID range we can use for our container.
We start with the UID the tree is already assigned to, and if that
didn't work we'd pick random ranges so far. With this change we'll first
try to hash a suitable range from the container name, and use that if it
works, in order to make UID assignments more likely to be stable.

This follows a similar logic PID 1 follows when using DynamicUser=1.
2017-11-24 20:57:19 +01:00
Lennart Poettering abdb9b08f6 nspawn: make use of the RequestStop logic of scope units
Since time began, scope units had a concept of "Controllers", a bus peer
that would be notified when somebody requested a unit to stop. None of
our code used that facility so far, let's change that.

This way, nspawn can print a nice message when somebody invokes
"systemctl stop" on the container's scope unit, and then react with the
right action to shut it down.
2017-11-23 21:47:48 +01:00
Shawn Landden 4831981d89 tree-wide: adjust fall through comments so that gcc is happy
Distcc removes comments, making the comment silencing
not work.

I know there was a decision against a macro in commit
ec251fe7d5
2017-11-20 13:06:25 -08:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Lennart Poettering 3603efdea5 nspawn: make recursive chown()ing logic safe for being aborted in the middle
We currently use the ownership of the top-level directory as a hint
whether we need to descent into the whole tree to chown() it recursively
or not. This is problematic with the previous chown()ing algorithm, as
when descending into the tree we'd first chown() and then descend
further down, which meant that the top-level directory would be chowned
first, and an aborted recursive chowning would appear on the next
invocation as successful, even though it was not. Let's reshuffle things
a bit, to make the re-chown()ing safe regarding interruptions:

a) We chown() the dir we are looking at last, and descent into all its
   children first. That way we know that if the top-level dir is
   properly owned everything inside of it is properly owned too.

b) Before starting a chown()ing operation, we mark the top-level
   directory as owned by a special "busy" UID range, which we can use to
   recognize whether a tree was fully chowned: if it is marked as busy,
   it's definitely not fully chowned, as the busy ownership will only be
   fixed as final step of the chowning.

Fixes: #6292
2017-11-17 11:12:33 +01:00
Lennart Poettering 0986658d51
Merge pull request #6866 from sourcejedi/set-linger2
logind: fix `loginctl enable-linger`
2017-11-15 11:15:15 +01:00
Lennart Poettering 759aaedc5c dissect: when we invoke dissection on a loop device with partscan help the user
This adds some simply detection logic for cases where dissection is
invoked on an externally created loop device, and partitions have been
detected on it, but partition scanning so far was off. If this is
detected we now print a brief message indicating what the issue is,
instead of failing with a useless EINVAL message the kernel passed to
us.
2017-10-26 17:54:56 +02:00
Lennart Poettering eb38edce88 machine-image: add partial discovery of block devices as images
This adds some basic discovery of block device images for nspawn and
friends. Note that this doesn't add searching for block devices using
udev, but instead expects users to symlink relevant block devices into
/var/lib/machines. Discovery is hence done exactly like for
dir/subvol/raw file images, except that what is found may be a (symlink
to) a block device.

For now, we do not support cloning these images, but removal, renaming
and read-only flags are supported to the point where that makes sense.

Fixe: #6990
2017-10-26 17:54:56 +02:00
Alan Jenkins 8d9c2bca41 nspawn: comment to acknowledge lying about "user session" 2017-10-18 09:47:10 +01:00
Zbigniew Jędrzejewski-Szmek 349cc4a507 build-sys: use #if Y instead of #ifdef Y everywhere
The advantage is that is the name is mispellt, cpp will warn us.

$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build

squash! build-sys: use #if Y instead of #ifdef Y everywhere

v2:
- fix incorrect setting of HAVE_LIBIDN2
2017-10-04 12:09:29 +02:00
Andreas Rammhold 3742095b27
tree-wide: use IN_SET where possible
In addition to the changes from #6933 this handles cases that could be
matched with the included cocci file.
2017-10-02 13:09:54 +02:00
Lennart Poettering 8e5430c4bd nspawn: set up a new session keyring for the container process
keyring material should not leak into the container. So far we relied on
seccomp to deny access to the keyring, but given that we now made the
seccomp configurable, and access to keyctl() and friends may optionally
be permitted to containers now let's make sure we disconnect the callers
keyring from the keyring of PID 1 in the container.
2017-09-22 15:28:04 +02:00
Lennart Poettering 960e4569e1 nspawn: implement configurable syscall whitelisting/blacklisting
Now that we have ported nspawn's seccomp code to the generic code in
seccomp-util, let's extend it to support whitelisting and blacklisting
of specific additional syscalls.

This uses similar syntax as PID1's support for system call filtering,
but in contrast to that always implements a blacklist (and not a
whitelist), as we prepopulate the filter with a blacklist, and the
unit's system call filter logic does not come with anything
prepopulated.

(Later on we might actually want to invert the logic here, and
whitelist rather than blacklist things, but at this point let's not do
that. In case we switch this over later, the syscall add/remove logic of
this commit should be compatible conceptually.)

Fixes: #5163

Replaces: #5944
2017-09-12 14:06:21 +02:00
Lennart Poettering 21022b9dde util-lib: wrap personality() to fix up broken glibc error handling (#6766)
glibc appears to propagate different errors in different ways, let's fix
this up, so that our own code doesn't get confused by this.

See #6752 + #6737 for details.

Fixes: #6755
2017-09-08 17:16:29 +03:00
Lennart Poettering 8cb5743079 nspawn: downgrade warning when we get sd_notify() message from unexpected process (#6416)
Given that we set NOTIFY_SOCKET unconditionally it's not surprising that
processes way down the process tree think it's smart to send us a
notification message.

It's still useful to keep this message, for debugging things, but it
shouldn't be generated by default.
2017-07-20 14:46:58 -04:00
Lennart Poettering cd2dfc6fae nspawn: register a scope for the unit if --register=no is specified (#6166)
Previously, only when --register=yes was set (the default) the invoked
container would get its own scope, created by machined on behalf of
nspawn. With this change if --register=no is set nspawn will still get
its own scope (which is a good thing, so that --slice= and --property=
take effect), but this is not done through machined but by registering a
scope unit directly in PID 1.

Summary:

--register=yes             → allocate a new scope through machined (the default)
--register=yes --keep-unit → use the unit we are already running in an register with machined
--register=no              → allocate a new scope directly, but no machined
--register=no --keep-unit  → do not allocate nor register anything

Fixes: #5823
2017-06-28 13:22:46 -04:00
Zbigniew Jędrzejewski-Szmek 35bca925f9 tree-wide: fix incorrect uses of %m
In those cases errno was not set, so we would be logging some unrelated error
or "Success".
2017-05-13 15:42:26 -04:00
Zbigniew Jędrzejewski-Szmek ab8ee0f259 tree-wide: use SET_FLAG in more places (#5892) 2017-05-07 07:03:28 -04:00
Zbigniew Jędrzejewski-Szmek 399e391fa6 nspawn: check cgroups after parsing options
Same justification as in previous commit.
2017-04-25 08:54:00 -04:00
Lennart Poettering 948a3241de Merge pull request #5708 from vcatechnology/arm-cross-compile
ARM32 cross-compile fixes
2017-04-17 15:49:06 +02:00
Matt Clarkson 6b5cf3ea62 build-sys: correct blkid.h includes
When using pkg-config to determine the include flags for blkid the
flags are returned as:

    $ pkg-config blkid --cflags
    -I/usr/include/blkid -I/usr/include/uuid

We use the <blkid/blkid.h> include which would be correct when using
the default compiler /usr/include header search path. However, when
cross-compiling the blkid.h will not be installed at /usr/include and
highly likely in a temporary system root. It is futher compounded if
the cross-compile packages are split up and the blkid package is not
available in the same sysroot as the compiler.

Regardless of the compilation setup, the correct include path should be
<blkid.h> if using the pkg-config returned CFLAGS.
2017-04-06 14:33:02 +01:00
David Michael 7357272ed1 nspawn: check if the DNS stub is listening for requests 2017-03-31 11:34:32 -07:00
Zbigniew Jędrzejewski-Szmek 78e4f19ebc Merge pull request #5444 from poettering/cgroups-revert-no-error
Revert "core: simplify cg_[all_]unified()" and more.
2017-02-24 18:48:57 -05:00
AsciiWolf 13e785f7a0 Fix missing space in comments (#5439) 2017-02-24 18:14:02 +01:00
Lennart Poettering c22800e40e cgroup: rename cg_unified() → cg_unified_controller()
cg_unified() is a bit generic a name, let's make clear that it checks
whether a specified controller is in unified mode.
2017-02-24 18:00:04 +01:00
Lennart Poettering b4cccbc13a cgroup: change cg_unified() to possibly return errors again
We use our cgroup APIs in various contexts, including from our libraries
sd-login, sd-bus. As we don#t control those environments we can't rely
that the unified cgroup setup logic succeeds, and hence really shouldn't
assert on it.

This more or less reverts 415fc41cea.
2017-02-24 17:52:58 +01:00
Tejun Heo 2977724b09 core: make hybrid cgroup unified mode keep compat /sys/fs/cgroup/systemd hierarchy
Currently the hybrid mode mounts cgroup v2 on /sys/fs/cgroup instead of the v1
name=systemd hierarchy.  While this works fine for systemd itself, it breaks
tools which expect cgroup v1 hierarchy on /sys/fs/cgroup/systemd.

This patch updates the hybrid mode so that it mounts v2 hierarchy on
/sys/fs/cgroup/unified and keeps v1 "name=systemd" hierarchy on
/sys/fs/cgroup/systemd for compatibility.  systemd itself doesn't depend on the
"name=systemd" hierarchy at all.  All operations take place on the v2 hierarchy
as before but the v1 hierarchy is kept in sync so that any tools which expect
it to be there can keep doing so.  This allows systemd to take advantage of
cgroup v2 process management without requiring other tools to be aware of the
hybrid mode.

The hybrid mode is implemented by mapping the special systemd controller to
/sys/fs/cgroup/unified and making the basic cgroup utility operations -
cg_attach(), cg_create(), cg_rmdir() and cg_trim() - also operate on the
/sys/fs/cgroup/systemd hierarchy whenever the cgroup2 hierarchy is updated.

While a bit messy, this will allow dropping complications from using cgroup v1
for process management a lot sooner than otherwise possible which should make
it a net gain in terms of maintainability.

v2: Fixed !cgns breakage reported by @evverx and renamed the unified mount
    point to /sys/fs/cgroup/unified as suggested by @brauner.

v3: chown the compat hierarchy too on delegation.  Suggested by @evverx.

v4: [zj]
- drop the change to default, full "legacy" is still the default.
2017-02-20 12:28:35 -05:00
Tejun Heo 415fc41cea core: simplify cg_[all_]unified()
cg_[all_]unified() test whether a specific controller or all controllers are on
the unified hierarchy.  While what's being asked is a simple binary question,
the callers must assume that the functions may fail any time, which
unnecessarily complicates their usages.  This complication is unnecessary.
Internally, the test result is cached anyway and there are only a few places
where the test actually needs to be performed.

This patch simplifies cg_[all_]unified().

* cg_[all_]unified() are updated to return bool.  If the result can't be
  decided, assertion failure is triggered.  Error handlings from their callers
  are dropped.

* cg_unified_flush() is updated to calculate the new result synchrnously and
  return whether it succeeded or not.  Places which need to flush the test
  result are updated to test for failure.  This ensures that all the following
  cg_[all_]unified() tests succeed.

* Places which expected possible cg_[all_]unified() failures are updated to
  call and test cg_unified_flush() before calling cg_[all_]unified().  This
  includes functions used while setting up mounts during boot and
  manager_setup_cgroup().
2017-02-18 17:51:13 -05:00
Tejun Heo bd15ab41a1 nspawn: fix cgroup mode detection
cgroup mode detection is broken in two different ways.

* detect_unified_cgroup_hierarchy() is called too nested in outer_child().
  sync_cgroup() which is used by run() also needs to know the requested cgroup
  mode but it's currently always getting CGROUP_UNIFIED_UNKNOWN.  This makes it
  skip syncing the inner cgroup hierarchy on some config combinations.

   $ cat /proc/self/cgroup | grep systemd
   1:name=systemd:/user.slice/user-0.slice/session-c1.scope

   $ UNIFIED_CGROUP_HIERARCHY=0 SYSTEMD_NSPAWN_USE_CGNS=0 systemd-nspawn -M container
   ...
   [root@container ~]# cat /proc/self/cgroup | grep systemd
   1:name=systemd:/machine.slice/machine-container.x86_64.scope
   $ exit

   $ UNIFIED_CGROUP_HIERARCHY=1 SYSTEMD_NSPAWN_USE_CGNS=0 systemd-nspawn -M container
   [root@container ~]# cat /proc/self/cgroup | grep 0::
   0::/
   $ exit

  Note how the unified hierarchy case's path is not synchronized with the host.
  This for example can cause issues when there are multiple such containers.

  Fixed by moving detect_unified_cgroup_hierarchy() invocation to main().

* inner_child() was invoking cg_unified_flush().  inner_child() executes fully
  scoped and can't determine which cgroup mode the host was in.  It doesn't
  make sense to keep flushing the detected mode when the host mode can't
  change.

  Fixed by replacing cg_unified_flush() invocations in outer_child() and
  inner_child() with one in main().
2017-02-18 17:49:06 -05:00
Zbigniew Jędrzejewski-Szmek 581a07f9f0 Merge pull request #5369 from poettering/nspawn-resolved
fixes for running nspawn+resolved in combination
2017-02-18 11:54:34 -05:00