Commit graph

115 commits

Author SHA1 Message Date
Lennart Poettering 30874dda3a dev-setup: generalize logic we use to create "inaccessible" device nodes
Let's generalize this, so that we can use this in nspawn later on, which
is pretty useful as we need to be able to mask files from the inner
child of nspawn too, where the host's /run/systemd/inaccessible
directory is not visible anymore. Moreover, if nspawn can create these
nodes on its own before the payload this means the payload can run with
fewer privileges.
2018-11-29 20:21:40 +01:00
Lennart Poettering 143fadf369 core: remove JoinControllers= configuration setting
This removes the ability to configure which cgroup controllers to mount
together. Instead, we'll now hardcode that "cpu" and "cpuacct" are
mounted together as well as "net_cls" and "net_prio".

The concept of mounting controllers together has no future as it does
not exist to cgroupsv2. Moreover, the current logic is systematically
broken, as revealed by the discussions in #10507. Also, we surveyed Red
Hat customers and couldn't find a single user of the concept (which
isn't particularly surprising, as it is broken...)

This reduced the (already way too complex) cgroup handling for us, since
we now know whenever we make a change to a cgroup for one controller to
which other controllers it applies.
2018-11-16 14:54:13 +01:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Lennart Poettering ef31828d06 tree-wide: unify how we define bit mak enums
Let's always write "1 << 0", "1 << 1" and so on, except where we need
more than 31 flag bits, where we write "UINT64(1) << 0", and so on to force
64bit values.
2018-06-12 21:44:00 +02:00
Zbigniew Jędrzejewski-Szmek 6978efcffb core/mount-setup: remove part of check which is always true
f1470e424b removed one check, but missed a similar
one a few lines down.

CID #1390949.
2018-05-14 08:50:00 +02:00
Zbigniew Jędrzejewski-Szmek f1470e424b core/mount-setup: remove part of check which is always true
k was set to join_controllers at this point and only incremented, so
it cannot be null at this point.

CID #1390949.
2018-05-10 02:03:23 +02:00
Lennart Poettering fe80fcc7e8 mount-setup: add a comment that the character/block device nodes are "optional" (#8893)
if we lack privs to create device nodes that's fine, and creating
/run/systemd/inaccessible/chr or /run/systemd/inaccessible/blk won't
work then. Document this in longer comments.

Fixes: #4484
2018-05-03 23:10:35 +09:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Lennart Poettering 771b7ead84 machine-image,mount-setup: minor coding style fixes 2018-03-28 22:04:58 +02:00
Krzysztof Nowicki 6f7729c176 core: dont't remount /sys/fs/cgroup for relabel if not needed (#8595)
The initial fix for relabelling the cgroup filesystem for
SELinux delivered in commit 8739f23e3 was based on the assumption that
the cgroup filesystem is already populated once mount_setup() is
executed, which was true for my system. What I wasn't aware is that this
is the case only when another instance of systemd was running before
this one, which can happen if systemd is used in the initrd (for ex. by
dracut).

In case of a clean systemd start-up the cgroup filesystem is actually
being populated after mount_setup() and does not need relabelling as at
that moment the SELinux policy is already loaded. Since however the root
cgroup filesystem was remounted read-only in the meantime this operation
will now fail.

To fix this check for the filesystem mount flags before relabelling and
only remount ro->rw->ro if necessary and leave the filesystem read-write
otherwise.

Fixes #7901.
2018-03-28 13:36:33 +02:00
Lennart Poettering 08c849815c label: rework label_fix() implementations (#8583)
This reworks the SELinux and SMACK label fixing calls in a number of
ways:

1. The two separate boolean arguments of these functions are converted
   into a flags type LabelFixFlags.

2. The operations are now implemented based on O_PATH. This should
   resolve TTOCTTOU races between determining the label for the file
   system object and applying it, as it it allows to pin the object
   while we are operating on it.

3. When changing a label fails we'll query the label previously set, and
   if matches what we want to set anyway we'll suppress the error.

Also, all calls to label_fix() are now (void)ified, when we ignore the
return values.

Fixes: #8566
2018-03-27 07:38:26 +02:00
Lennart Poettering ae2a15bc14 macro: introduce TAKE_PTR() macro
This macro will read a pointer of any type, return it, and set the
pointer to NULL. This is useful as an explicit concept of passing
ownership of a memory area between pointers.

This takes inspiration from Rust:

https://doc.rust-lang.org/std/option/enum.Option.html#method.take

and was suggested by Alan Jenkins (@sourcejedi).

It drops ~160 lines of code from our codebase, which makes me like it.
Also, I think it clarifies passing of ownership, and thus helps
readability a bit (at least for the initiated who know the new macro)
2018-03-22 20:21:42 +01:00
Yu Watanabe 5cbaad2f67 core: do not free heap-allocated strings (#8391)
Fixes #8387.
2018-03-08 14:21:54 +01:00
Lennart Poettering 39f305a901 mount-setup: change bpf mount mode to 0700 (#8334)
After discussing with the kernel folks, we agreed to default to 0700 for
this. Better safe than sorry.
2018-03-02 12:55:24 +01:00
Lennart Poettering 6590080851 mount-setup: always use the same source as fstype for the API VFS we mount
So far, for all our API VFS mounts we used the fstype also as mount
source, let's do that for the cgroupsv2 mounts too. The kernel doesn't
really care about the source for API VFS, but it's visible to the user,
hence let's clean this up and follow the rule we otherwise follow.
2018-02-21 16:43:36 +01:00
Lennart Poettering 43b7f24b5e bpf: mount bpffs by default on boot
We make heavy use of BPF functionality these days, hence expose the BPF
file system too by default now. (Note however, that we don't actually
make use bpf file systems object yet, but we might later on too.)
2018-02-21 16:43:36 +01:00
Zbigniew Jędrzejewski-Szmek 56c8d7444a pid1: do not initialize join_controllers by default
We're moving towards unified cgroup hierarchy where this is not necessary.
This makes main.c a bit simpler.
2018-02-19 15:18:54 +01:00
Lennart Poettering 713a88757a mount-setup: fix MNT_CHECK_WRITABLE error handling, and log about the issue
Let's correct the error handling (the error is in errno, not r), and
let's add logging like the rest of the function has it.
2017-12-15 20:52:28 +01:00
Krzysztof Nowicki 8739f23e3c Fix SELinux labels in cgroup filesystem root directory (#7496)
When using SELinux with legacy cgroups the tmpfs on /sys/fs/cgroup is by
default labelled as tmpfs_t. This label is also inherited by the "cpu"
and "cpuacct" symbolic links. Unfortunately the policy expects them to
be labelled as cgroup_t, which is used for all the actual cgroup
filesystems. Failure to do so results in a stream of denials.

This state cannot be fixed reliably when the cgroup filesystem structure
is set-up as the SELinux policy is not yet loaded at this
moment. It also cannot be fixed later as the root of the cgroup
filesystem is remounted read-only. In order to fix it the root of the
cgroup filesystem needs to be temporary remounted read-write, relabelled
and remounted back read-only.
2017-11-30 11:59:29 +01:00
Christian Brauner 1ff654e28b core: remove empty cgroups (#7457)
When we skip an unwritable cgroup also remove the empty mountpoint.
2017-11-24 21:05:16 +01:00
Christian Brauner 2d56b80a18
cgroup: test whether pure unified hierarchy is writable
If it is not writable we should not mount it.
2017-11-22 17:35:21 +01:00
Christian Brauner e07aefbd67
cgroup: check whether unified hierarchy is writable
When systemd is running inside a container employing user
namespaces it currently mounts the unified cgroup hierarchy
without being able to write to it. This causes systemd to
freeze during boot.
This patch checks whether the unified cgroup hierarchy
is writable. If it is not it will not mount it.

This solution is based on a patch by Evgeny Vereshchagin.

Closes #6408.
Closes https://github.com/lxc/lxc/issues/1678 .
2017-11-22 17:34:25 +01:00
Lennart Poettering 6925a0de4e cgroup-util: move Set* allocation into cg_kernel_controllers()
Previously, callers had to do this on their own. Let's make the call do
that instead, making the caller code a bit shorter.
2017-11-21 11:54:08 +01:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Zbigniew Jędrzejewski-Szmek f9fa32f09c build-sys: s/HAVE_SMACK/ENABLE_SMACK/
Same justification as for HAVE_UTMP.
2017-10-04 12:09:50 +02:00
Zbigniew Jędrzejewski-Szmek 349cc4a507 build-sys: use #if Y instead of #ifdef Y everywhere
The advantage is that is the name is mispellt, cpp will warn us.

$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build

squash! build-sys: use #if Y instead of #ifdef Y everywhere

v2:
- fix incorrect setting of HAVE_LIBIDN2
2017-10-04 12:09:29 +02:00
vliaskov 6c24adfd46 Revert "mount-setup: mount xenfs filesystem (#6491)" (#6662)
This reverts commit b305bd3aab.
2017-08-28 18:46:01 +02:00
vliaskov b305bd3aab mount-setup: mount xenfs filesystem (#6491) 2017-07-31 15:59:02 +02:00
Tejun Heo 4095205ecc core: support "nsdelegate" cgroup v2 mount option (#6294)
cgroup namespace wasn't useful for delegation because it allowed resource
control interface files (e.g. memory.high) to be written from inside the
namespace - this allowed the namespace parent's resource distribution to be
disturbed by its namespace-scoped children.

A new mount option, "nsdelegate", was added to cgroup v2 to address this issue.
The flag is meangingful only when mounting cgroup v2 in the init namespace and
makes a cgroup namespace a delegation boundary.  The kernel feature is pending
for v4.13.

This should have been the default behavior on cgroup namespaces and this commit
makes systemd try "nsdelegate" first when trying to mount cgroup v2 and fall
back if the option is not supported.

Note that this has danger of breaking usages which depend on modifying the
parent's resource settings from the namespace root, which isn't a valid thing
to do, but such usages may still exist.
2017-07-14 19:27:13 +02:00
Zbigniew Jędrzejewski-Szmek 1b59cf04ae core/mount-setup: if unified hierarchy is not supported, fall back to legacy
We need this to gracefully support older or strangely configured kernels.

v2:
- do not install a callback handler, just embed the right conditions into
  cg_is_*_wanted()

v3:
- fix bug in cg_is_legacy_wanted()
2017-02-22 11:52:31 -05:00
Zbigniew Jędrzejewski-Szmek a4464b9522 Rename cg_is_unified_systemd_controller_wanted to cg_is_hybrid_wanted
Less typing and doesn't make the table so incredibly wide.
2017-02-22 11:52:31 -05:00
Tejun Heo 2977724b09 core: make hybrid cgroup unified mode keep compat /sys/fs/cgroup/systemd hierarchy
Currently the hybrid mode mounts cgroup v2 on /sys/fs/cgroup instead of the v1
name=systemd hierarchy.  While this works fine for systemd itself, it breaks
tools which expect cgroup v1 hierarchy on /sys/fs/cgroup/systemd.

This patch updates the hybrid mode so that it mounts v2 hierarchy on
/sys/fs/cgroup/unified and keeps v1 "name=systemd" hierarchy on
/sys/fs/cgroup/systemd for compatibility.  systemd itself doesn't depend on the
"name=systemd" hierarchy at all.  All operations take place on the v2 hierarchy
as before but the v1 hierarchy is kept in sync so that any tools which expect
it to be there can keep doing so.  This allows systemd to take advantage of
cgroup v2 process management without requiring other tools to be aware of the
hybrid mode.

The hybrid mode is implemented by mapping the special systemd controller to
/sys/fs/cgroup/unified and making the basic cgroup utility operations -
cg_attach(), cg_create(), cg_rmdir() and cg_trim() - also operate on the
/sys/fs/cgroup/systemd hierarchy whenever the cgroup2 hierarchy is updated.

While a bit messy, this will allow dropping complications from using cgroup v1
for process management a lot sooner than otherwise possible which should make
it a net gain in terms of maintainability.

v2: Fixed !cgns breakage reported by @evverx and renamed the unified mount
    point to /sys/fs/cgroup/unified as suggested by @brauner.

v3: chown the compat hierarchy too on delegation.  Suggested by @evverx.

v4: [zj]
- drop the change to default, full "legacy" is still the default.
2017-02-20 12:28:35 -05:00
Lennart Poettering dee22f3970 core: add comment why we don't bother with MS_SHARED remounting of / in containers 2016-12-20 20:00:08 +01:00
Lennart Poettering e187369587 tree-wide: stop using canonicalize_file_name(), use chase_symlinks() instead
Let's use chase_symlinks() everywhere, and stop using GNU
canonicalize_file_name() everywhere. For most cases this should not change
behaviour, however increase exposure of our function to get better tested. Most
importantly in a few cases (most notably nspawn) it can take the correct root
directory into account when chasing symlinks.
2016-12-01 00:25:51 +01:00
Tejun Heo 5da38d0768 core: use the unified hierarchy for the systemd cgroup controller hierarchy
Currently, systemd uses either the legacy hierarchies or the unified hierarchy.
When the legacy hierarchies are used, systemd uses a named legacy hierarchy
mounted on /sys/fs/cgroup/systemd without any kernel controllers for process
management.  Due to the shortcomings in the legacy hierarchy, this involves a
lot of workarounds and complexities.

Because the unified hierarchy can be mounted and used in parallel to legacy
hierarchies, there's no reason for systemd to use a legacy hierarchy for
management even if the kernel resource controllers need to be mounted on legacy
hierarchies.  It can simply mount the unified hierarchy under
/sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies.
This disables a significant amount of fragile workaround logics and would allow
using features which depend on the unified hierarchy membership such bpf cgroup
v2 membership test.  In time, this would also allow deleting the said
complexities.

This patch updates systemd so that it prefers the unified hierarchy for the
systemd cgroup controller hierarchy when legacy hierarchies are used for kernel
resource controllers.

* cg_unified(@controller) is introduced which tests whether the specific
  controller in on unified hierarchy and used to choose the unified hierarchy
  code path for process and service management when available.  Kernel
  controller specific operations remain gated by cg_all_unified().

* "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to
  force the use of legacy hierarchy for systemd cgroup controller.

* nspawn: By default nspawn uses the same hierarchies as the host.  If
  UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all.  If
  0, legacy for all.

* nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of
  three options - legacy, only systemd controller on unified, and unified.  The
  value is passed into mount setup functions and controls cgroup configuration.

* nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount
  option is moved to mount_legacy_cgroup_hierarchy() so that it can take an
  appropriate action depending on the configuration of the host.

v2: - CGroupUnified enum replaces open coded integer values to indicate the
      cgroup operation mode.
    - Various style updates.

v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2.

v4: Restored legacy container on unified host support and fixed another bug in
    detect_unified_cgroup_hierarchy().
2016-08-17 17:44:36 -04:00
Alessandro Puccetti c4b4170746 namespace: unify limit behavior on non-directory paths
Despite the name, `Read{Write,Only}Directories=` already allows for
regular file paths to be masked. This commit adds the same behavior
to `InaccessibleDirectories=` and makes it explicit in the doc.
This patch introduces `/run/systemd/inaccessible/{reg,dir,chr,blk,fifo,sock}`
{dile,device}nodes and mounts on the appropriate one the paths specified
in `InacessibleDirectories=`.

Based on Luca's patch from https://github.com/systemd/systemd/pull/3327
2016-07-19 17:22:02 +02:00
Dave Reisner 222953e87f Ensure kdbus isn't used (#3501)
Delete the dbus1 generator and some critical wiring. This prevents
kdbus from being loaded or detected. As such, it will never be used,
even if the user still has a useful kdbus module loaded on their system.

Sort of fixes #3480. Not really, but it's better than the current state.
2016-06-18 17:24:23 -04:00
Harald Hoyer cacf980ed4 core/mount-setup.c: also relabel /dev/shm for selinux (#3039)
daemons, which wish to transition state from the initramfs to the real
root, might use /dev/shm for their state.

As /dev is not relabeled across mount points, /dev/shm has to be
relabled explicitly.
2016-04-14 19:14:29 -04:00
Alban Crequy 099619957a cgroup2: use new fstype for unified hierarchy
Since Linux v4.4-rc1, __DEVEL__sane_behavior does not exist anymore and
is replaced by a new fstype "cgroup2".

With this patch, systemd no longer supports the old (unstable) way of
doing unified hierarchy with __DEVEL__sane_behavior and systemd now
requires Linux v4.4 for unified hierarchy.

Non-unified hierarchy is still the default and is unchanged by this
patch.

67e9c74b8a
2016-03-26 12:05:29 -04:00
Daniel Mack b26fa1a2fb tree-wide: remove Emacs lines from all files
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
2016-02-10 13:41:57 +01:00
Lennart Poettering 1411b09467 core: log about path_is_mount_point() errors
We really shouldn't fail silently, but print a log message about these errors. Also make sure to attach error codes to
all log messages where that makes sense.

(While we are at it, add a couple of (void) casts to functions where we knowingly ignore return values.)
2016-02-03 23:58:53 +01:00
Alexander Kuleshov 400fac0609 mount-setup: introduce mount_points_setup
The mount_setup_early() and mount_setup() contain almost the same
pieces of code which calls mount_one() for a certain mount point
from the mount_table. This patch introduces mount_points_setup()
helper to prevent code duplication.
2016-02-03 01:03:12 +06:00
Patrick Ohly ea2b93a8ee mount-setup.c: fix handling of symlink Smack labelling in cgroup setup
The code introduced in f8c1a81c51 (= systemd 227) failed for me with:
  Failed to copy smack label from net_cls to /sys/fs/cgroup/net_cls: No such file or directory

There is no need for a symlink in this case because source and target
are identical. The symlink() call is allowed to fail when the target
already exists. When that happens, copying the Smack label must be
skipped.

But the code also failed when there is a symlink, like "cpu ->
cpu,cpuacct", because mac_smack_copy() got called with
src="cpu,cpuacct" which fails to find the entry because the current
directory is not inside /sys/fs/cgroup. The absolute path to the existing
entry must be used instead.
2016-01-05 12:49:48 +01:00
Thomas Hindoe Paaboel Andersen cf0fbc49e6 tree-wide: sort includes
Sort the includes accoding to the new coding style.
2015-11-16 22:09:36 +01:00
Lennart Poettering b5efdb8af4 util-lib: split out allocation calls into alloc-util.[ch] 2015-10-27 13:45:53 +01:00
Lennart Poettering ee104e11e3 user-util: move UID/GID related macros from macro.h to user-util.h 2015-10-27 13:25:57 +01:00
Lennart Poettering 4349cd7c1d util-lib: move mount related utility calls to mount-util.[ch] 2015-10-27 13:25:55 +01:00
David Herrmann 7ff307bc4c mount: propagate error codes correctly
Make sure to propagate error codes from mount-loops correctly. Right now,
we return the return-code of the first mount that did _something_. This is
not what we want. Make sure we return an error if _any_ mount fails (and
then make sure to return the first error to not hide proper errors due to
consequential errors like -ENOTDIR).

Reported by cee1 <fykcee1@gmail.com>.
2015-09-21 20:03:24 +02:00
Sangjung Woo f8c1a81c51 smack: bugfix the smack label of symlink when '--with-smack-run-label' is set
Even though systemd has its own smack label since
'--with-smack-run-label' configuration is set, the smack label of each
CGROUP root directory should have the star (i.e. *) label. This is
mainly because current Linux Kernel set the label in this way.
(Refer to smack_d_instantiate() in security/smack/smack_lsm.c)

However, if systemd has its own smack label and arg_join_controllers is
explicitly set or initialized by initialize_join_controllers() function,
current systemd creates the symlink in CGROUP root directory with its
own smack label as below.

lrwxrwxrwx. 1 root root System  11 Dec 31 16:00 cpu -> cpu,cpuacct
dr-xr-xr-x. 4 root root *        0 Dec 31 16:01 cpu,cpuacct
lrwxrwxrwx. 1 root root System  11 Dec 31 16:00 cpuacct -> cpu,cpuacct

This patch fixes that bug by copying the smack label from the origin.
2015-09-09 20:26:52 +09:00