Commit graph

144 commits

Author SHA1 Message Date
Frantisek Sumsal 69e3234db7 tree-wide: fix typos found by codespell
Reported by Fossies.org
2020-09-14 15:32:37 +02:00
Lennart Poettering bbb4e7f39f core: hide /run/credentials whenever namespacing is requested
Ideally we would like to hide all other service's credentials for all
services. That would imply for us to enable mount namespacing for all
services, which is something we cannot do, both due to compatibility
with the status quo ante, and because a number of services legitimately
should be able to install mounts in the host hierarchy.

Hence we do the second best thing, we hide the credentials automatically
for all services that opt into mount namespacing otherwise. This is
quite different from other mount sandboxing options: usually you have to
explicitly opt into each. However, given that the credentials logic is a
brand new concept we invented right here and now, and particularly
security sensitive it's OK to reverse this, and by default hide
credentials whenever we can (i.e. whenever mount namespacing is
otherwise opt-ed in to).

Long story short: if you want to hide other service's credentials, the
most basic options is to just turn on PrivateMounts= and there you go,
they should all be gone.
2020-08-25 19:45:38 +02:00
Lennart Poettering 5b14956385
Merge pull request #16543 from poettering/nspawn-run-host
nspawn: /run/host/ tweaks
2020-08-20 16:20:05 +02:00
Zbigniew Jędrzejewski-Szmek ec673ad4ab
Merge pull request #16559 from benzea/benzea/memory-recursiveprot
mount-setup: Enable memory_recursiveprot for cgroup2
2020-08-20 13:05:07 +02:00
Lennart Poettering 9fac502920 nspawn,pid1: pass "inaccessible" nodes from cntr mgr to pid1 payload via /run/host
Let's make /run/host the sole place we pass stuff from host to container
in and place the "inaccessible" nodes in /run/host too.

In contrast to the previous two commits this is a minor compat break, but
not a relevant one I think. Previously the container manager would place
these nodes in /run/systemd/inaccessible/ and that's where PID 1 in the
container would try to add them too when missing. Container manager and
PID 1 in the container would thus manage the same dir together.

With this change the container manager now passes an immutable directory
to the container and leaves /run/systemd entirely untouched, and managed
exclusively by PID 1 inside the container, which is nice to have clear
separation on who manages what.

In order to make sure systemd then usses the /run/host/inaccesible/
nodes this commit changes PID 1 to look for that dir and if it exists
will symlink it to /run/systemd/inaccessible.

Now, this will work fine if new nspawn and new pid 1 in the container
work together. as then the symlink is created and the difference between
the two dirs won't matter.

For the case where an old nspawn invokes a new PID 1: in this case
things work as they always worked: the dir is managed together.

For the case where different container manager invokes a new PID 1: in
this case the nodes aren't typically passed in, and PID 1 in the
container will try to create them and will likely fail partially (though
gracefully) when trying to create char/block device nodes. THis is fine
though as there are fallbacks in place for that case.

For the case where a new nspawn invokes an old PID1: this is were the
(minor) incompatibily happens: in this case new nspawn will place the
nodes in the /run/host/inaccessible/ subdir, but the PID 1 in the
container won't look for them there. Since the nodes are also not
pre-created in /run/systed/inaccessible/ PID 1 will try to create them
there as if a different container manager sets them up. This is of
course not sexy, but is not a total loss, since as mentioned fallbacks
are in place anyway. Hence I think it's OK to accept this minor
incompatibility.
2020-08-20 10:17:52 +02:00
Lennart Poettering 4428c49db9 mount-setup: drop pointless zero initialization 2020-08-19 18:11:00 +02:00
Benjamin Berg 56f47800d8 mount-setup: Enable memory_recursiveprot for cgroup2
When available, enable memory_recursiveprot. Realistically it always
makes sense to delegate MemoryLow= and MemoryMin= to all children of a
slice/unit.

The kernel option is not enabled by default as it might cause
regressions in some setups. However, it is the better default in
general, and it results in a more flexible and obvious behaviour.

The alternative to using this option would be for user's to also set
DefaultMemoryLow= on slices when assigning MemoryLow=. However, this
makes the effect of MemoryLow= on some children less obvious, as it
could result in a lower protection rather than increasing it.

From the kernel documentation:

  memory_recursiveprot

        Recursively apply memory.min and memory.low protection to
        entire subtrees, without requiring explicit downward
        propagation into leaf cgroups.  This allows protecting entire
        subtrees from one another, while retaining free competition
        within those subtrees.  This should have been the default
        behavior but is a mount-option to avoid regressing setups
        relying on the original semantics (e.g. specifying bogusly
        high 'bypass' protection values at higher tree levels).

This was added in kernel commit 8a931f801340c (mm: memcontrol:
recursive memory.low protection), which became available in 5.7 and was
subsequently fixed in kernel 5.7.7 (mm: memcontrol: handle div0 crash
race condition in memory.low).
2020-08-19 11:17:01 +02:00
Zbigniew Jędrzejewski-Szmek b67ec8e5b2 pid1: stop limiting size of /dev/shm
The explicit limit is dropped, which means that we return to the kernel default
of 50% of RAM. See 362a55fc14 for a discussion why that is not as much as it
seems. It turns out various applications need more space in /dev/shm and we
would break them by imposing a low limit.

While at it, rename the define and use a single macro for various tmpfs mounts.
We don't really care what the purpose of the given tmpfs is, so it seems
reasonable to use a single macro.

This effectively reverts part of 7d85383edb. Fixes #16617.
2020-07-30 18:48:35 +02:00
Lennart Poettering 48b747fa03 inaccessible: move inaccessible file nodes to /systemd/ subdir in runtime dir always
Let's make sure $XDG_RUNTIME_DIR for the user instance and /run for the
system instance is always organized the same way: the "inaccessible"
device nodes should be placed in a subdir of either called "systemd" and
a subdir of that called "inaccessible".

This way we can emphasize the common behaviour, and only differ where
really necessary.

Follow-up for #13823
2020-06-09 16:23:56 +02:00
Topi Miettinen 7d85383edb tree-wide: add size limits for tmpfs mounts
Limit size of various tmpfs mounts to 10% of RAM, except volatile root and /var
to 25%. Another exception is made for /dev (also /devs for PrivateDevices) and
/sys/fs/cgroup since no (or very few) regular files are expected to be used.

In addition, since directories, symbolic links, device specials and xattrs are
not counted towards the size= limit, number of inodes is also limited
correspondingly: 4MB size translates to 1k of inodes (assuming 4k each), 10% of
RAM (using 16GB of RAM as baseline) translates to 400k and 25% to 1M inodes.

Because nr_inodes option can't use ratios like size option, there's an
unfortunate side effect that with small memory systems the limit may be on the
too large side. Also, on an extremely small device with only 256MB of RAM, 10%
of RAM for /run may not be enough for re-exec of PID1 because 16MB of free
space is required.
2020-05-13 00:37:18 +02:00
Wen Yang f74349d88b mount-setup: change the system mount propagation to shared by default only at bootup
The commit b3ac5f8cb9 has changed the system mount propagation to
shared by default, and according to the following patch:
https://github.com/opencontainers/runc/pull/208
When starting the container, the pouch daemon will call runc to execute
make-private.

However, if the systemctl daemon-reexec is executed after the container
has been started, the system mount propagation will be changed to share
again by default, and the make-private operation above will have no chance
to execute.
2020-04-09 10:14:20 +02:00
Topi Miettinen 3b5b6826aa mount-setup: make /dev noexec
/dev used to be mounted with "exec" flag due to /dev/MAKEDEV script but that's
history and it's now located in /sbin. mmap() with file descriptor to
"/dev/zero" (instead of modern mmap(,,,MAP_ANON...))  will still work.
2020-03-09 19:08:42 +01:00
Anita Zhang e5f10cafe0 core: create inaccessible nodes for users when making runtime dirs
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
2019-12-18 11:09:30 -08:00
Yu Watanabe f5947a5e92 tree-wide: drop missing.h 2019-10-31 17:57:03 +09:00
Zbigniew Jędrzejewski-Szmek 86e94d95d0
Merge pull request #13246 from keszybz/add-SystemdOptions-efi-variable
Add efi variable to augment /proc/cmdline
2019-10-03 12:19:44 +02:00
Zbigniew Jędrzejewski-Szmek 90b059b608 pid1: do not warn if /run/systemd/relabel-extra.d/ doesn't exist
After all, that is the expected state.
2019-09-19 18:01:40 +02:00
Zbigniew Jędrzejewski-Szmek 0bb2f0f10e util-lib: split shared/efivars into basic/efivars and shared/efi-loader
I want to use efivars.[ch] in proc-cmdline.c, but most of the efivars stuff is
not needed in basic/. Move the file from shared/ to basic/, but then move back
most of the higher-level functions to the new shared/efi-loader.c file.
2019-09-16 18:08:53 +02:00
Zbigniew Jędrzejewski-Szmek fdb3decaa7 util-lib: move some functions from basic/cgroup-util to shared/cgroup-setup
This way less stuff needs to be in basic. Initially, I wanted to move all the
parts of cgroup-utils.[ch] that depend on efivars.[ch] to shared, because
efivars.[ch] is in shared/. Later on, I decide to split efivars.[ch], so the
move done in this patch is not necessary anymore. Nevertheless, it is still
valid on its own. If at some point we want to expose libbasic, it is better to
to not have stuff that belong in libshared there.
2019-09-16 18:08:00 +02:00
Yu Watanabe f39fc2d88b
Merge pull request #13354 from keszybz/two-refactoring-patches
Two or more refactoring patches
2019-09-16 21:24:13 +09:00
Zbigniew Jędrzejewski-Szmek 36b12282e1 basic/conf-files: make conf_files_list() take just a single directory
This function had two users (apart from tests), and both only used one
argument. And it seems likely that if we need to pass more directories,
either the _nulstr() or the _strv() form would be used. Let's simplify
the code.
2019-09-16 09:15:05 +02:00
Zbigniew Jędrzejewski-Szmek 48da02ec6f core/mount-setup: use conf_files_list_strv() for relabel-extra.d/ 2019-09-16 09:15:05 +02:00
Benjamin Gilbert 71de68476c mount-setup: relabel items mentioned directly in relabel-extra.d
relabel_extra() relabels the descendants of directories listed in
relabel-extra.d, but doesn't relabel the files or directories
explicitly named there.  This makes it impossible to use
relabel-extra.d to relabel the root of a filesystem.  Fix by
relabeling the named items too.
2019-09-16 09:04:22 +02:00
Lennart Poettering b910cc72c0 tree-wide: get rid of strappend()
It's a special case of strjoin(), so no need to keep both. In particular
as typing strjoin() is even shoert than strappend().
2019-07-12 14:31:12 +09:00
Lennart Poettering d8b4d14df4 util: split out nulstr related stuff to nulstr-util.[ch] 2019-03-14 13:25:52 +01:00
Lennart Poettering 70a74ec645 mount-setup: don't consider it reason to fail if we can't relabel cgroupfs
We usually don't care much about relabel failures, let's not do that
here either.
2018-12-12 20:46:07 +01:00
Lennart Poettering c4217b43d1 mount-setup: use FOREACH_STRING where appropriate 2018-12-12 20:46:07 +01:00
Lennart Poettering 65e183d789 mount-setup: optionally, relabel a configured set of files/dirs after loading policy
Fixes: #10466
2018-12-12 20:46:07 +01:00
Zbigniew Jędrzejewski-Szmek b2ac2b01c8
Merge pull request #10996 from poettering/oci-prep
Preparation for the nspawn-OCI work
2018-11-30 10:09:00 +01:00
Zbigniew Jędrzejewski-Szmek 049af8ad0c Split out part of mount-util.c into mountpoint-util.c
The idea is that anything which is related to actually manipulating mounts is
in mount-util.c, but functions for mountpoint introspection are moved to the
new file. Anything which requires libmount must be in mount-util.c.

This was supposed to be a preparation for further changes, with no functional
difference, but it results in a significant change in linkage:

$ ldd build/libnss_*.so.2
(before)
build/libnss_myhostname.so.2:
	linux-vdso.so.1 (0x00007fff77bf5000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f4bbb7b2000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007f4bbb755000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4bbb734000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f4bbb56e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f4bbb8c1000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f4bbb51b000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f4bbb512000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f4bbb4e3000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f4bbb45e000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f4bbb458000)
build/libnss_mymachines.so.2:
	linux-vdso.so.1 (0x00007ffc19cc0000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fdecb74b000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fdecb744000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007fdecb6e7000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdecb6c6000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fdecb500000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fdecb8a9000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fdecb4ad000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fdecb4a2000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fdecb475000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007fdecb3f0000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fdecb3ea000)
build/libnss_resolve.so.2:
	linux-vdso.so.1 (0x00007ffe8ef8e000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fcf314bd000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fcf314b6000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007fcf31459000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fcf31438000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fcf31272000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fcf31615000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fcf3121f000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fcf31214000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fcf311e7000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007fcf31162000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fcf3115c000)
build/libnss_systemd.so.2:
	linux-vdso.so.1 (0x00007ffda6d17000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f610b83c000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007f610b835000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007f610b7d8000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f610b7b7000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f610b5f1000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f610b995000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f610b59e000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f610b593000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f610b566000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f610b4e1000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f610b4db000)

(after)
build/libnss_myhostname.so.2:
	linux-vdso.so.1 (0x00007fff0b5e2000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fde0c328000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fde0c307000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fde0c141000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fde0c435000)
build/libnss_mymachines.so.2:
	linux-vdso.so.1 (0x00007ffdc30a7000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f06ecabb000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007f06ecab4000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f06eca93000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f06ec8cd000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f06ecc15000)
build/libnss_resolve.so.2:
	linux-vdso.so.1 (0x00007ffe95747000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fa56a80f000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fa56a808000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa56a7e7000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fa56a621000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fa56a964000)
build/libnss_systemd.so.2:
	linux-vdso.so.1 (0x00007ffe67b51000)
	librt.so.1 => /lib64/librt.so.1 (0x00007ffb32113000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007ffb3210c000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffb320eb000)
	libc.so.6 => /lib64/libc.so.6 (0x00007ffb31f25000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ffb3226a000)

I don't quite understand what is going on here, but let's not be too picky.
2018-11-29 21:03:44 +01:00
Lennart Poettering 30874dda3a dev-setup: generalize logic we use to create "inaccessible" device nodes
Let's generalize this, so that we can use this in nspawn later on, which
is pretty useful as we need to be able to mask files from the inner
child of nspawn too, where the host's /run/systemd/inaccessible
directory is not visible anymore. Moreover, if nspawn can create these
nodes on its own before the payload this means the payload can run with
fewer privileges.
2018-11-29 20:21:40 +01:00
Lennart Poettering 143fadf369 core: remove JoinControllers= configuration setting
This removes the ability to configure which cgroup controllers to mount
together. Instead, we'll now hardcode that "cpu" and "cpuacct" are
mounted together as well as "net_cls" and "net_prio".

The concept of mounting controllers together has no future as it does
not exist to cgroupsv2. Moreover, the current logic is systematically
broken, as revealed by the discussions in #10507. Also, we surveyed Red
Hat customers and couldn't find a single user of the concept (which
isn't particularly surprising, as it is broken...)

This reduced the (already way too complex) cgroup handling for us, since
we now know whenever we make a change to a cgroup for one controller to
which other controllers it applies.
2018-11-16 14:54:13 +01:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Lennart Poettering ef31828d06 tree-wide: unify how we define bit mak enums
Let's always write "1 << 0", "1 << 1" and so on, except where we need
more than 31 flag bits, where we write "UINT64(1) << 0", and so on to force
64bit values.
2018-06-12 21:44:00 +02:00
Zbigniew Jędrzejewski-Szmek 6978efcffb core/mount-setup: remove part of check which is always true
f1470e424b removed one check, but missed a similar
one a few lines down.

CID #1390949.
2018-05-14 08:50:00 +02:00
Zbigniew Jędrzejewski-Szmek f1470e424b core/mount-setup: remove part of check which is always true
k was set to join_controllers at this point and only incremented, so
it cannot be null at this point.

CID #1390949.
2018-05-10 02:03:23 +02:00
Lennart Poettering fe80fcc7e8 mount-setup: add a comment that the character/block device nodes are "optional" (#8893)
if we lack privs to create device nodes that's fine, and creating
/run/systemd/inaccessible/chr or /run/systemd/inaccessible/blk won't
work then. Document this in longer comments.

Fixes: #4484
2018-05-03 23:10:35 +09:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Lennart Poettering 771b7ead84 machine-image,mount-setup: minor coding style fixes 2018-03-28 22:04:58 +02:00
Krzysztof Nowicki 6f7729c176 core: dont't remount /sys/fs/cgroup for relabel if not needed (#8595)
The initial fix for relabelling the cgroup filesystem for
SELinux delivered in commit 8739f23e3 was based on the assumption that
the cgroup filesystem is already populated once mount_setup() is
executed, which was true for my system. What I wasn't aware is that this
is the case only when another instance of systemd was running before
this one, which can happen if systemd is used in the initrd (for ex. by
dracut).

In case of a clean systemd start-up the cgroup filesystem is actually
being populated after mount_setup() and does not need relabelling as at
that moment the SELinux policy is already loaded. Since however the root
cgroup filesystem was remounted read-only in the meantime this operation
will now fail.

To fix this check for the filesystem mount flags before relabelling and
only remount ro->rw->ro if necessary and leave the filesystem read-write
otherwise.

Fixes #7901.
2018-03-28 13:36:33 +02:00
Lennart Poettering 08c849815c label: rework label_fix() implementations (#8583)
This reworks the SELinux and SMACK label fixing calls in a number of
ways:

1. The two separate boolean arguments of these functions are converted
   into a flags type LabelFixFlags.

2. The operations are now implemented based on O_PATH. This should
   resolve TTOCTTOU races between determining the label for the file
   system object and applying it, as it it allows to pin the object
   while we are operating on it.

3. When changing a label fails we'll query the label previously set, and
   if matches what we want to set anyway we'll suppress the error.

Also, all calls to label_fix() are now (void)ified, when we ignore the
return values.

Fixes: #8566
2018-03-27 07:38:26 +02:00
Lennart Poettering ae2a15bc14 macro: introduce TAKE_PTR() macro
This macro will read a pointer of any type, return it, and set the
pointer to NULL. This is useful as an explicit concept of passing
ownership of a memory area between pointers.

This takes inspiration from Rust:

https://doc.rust-lang.org/std/option/enum.Option.html#method.take

and was suggested by Alan Jenkins (@sourcejedi).

It drops ~160 lines of code from our codebase, which makes me like it.
Also, I think it clarifies passing of ownership, and thus helps
readability a bit (at least for the initiated who know the new macro)
2018-03-22 20:21:42 +01:00
Yu Watanabe 5cbaad2f67 core: do not free heap-allocated strings (#8391)
Fixes #8387.
2018-03-08 14:21:54 +01:00
Lennart Poettering 39f305a901 mount-setup: change bpf mount mode to 0700 (#8334)
After discussing with the kernel folks, we agreed to default to 0700 for
this. Better safe than sorry.
2018-03-02 12:55:24 +01:00
Lennart Poettering 6590080851 mount-setup: always use the same source as fstype for the API VFS we mount
So far, for all our API VFS mounts we used the fstype also as mount
source, let's do that for the cgroupsv2 mounts too. The kernel doesn't
really care about the source for API VFS, but it's visible to the user,
hence let's clean this up and follow the rule we otherwise follow.
2018-02-21 16:43:36 +01:00
Lennart Poettering 43b7f24b5e bpf: mount bpffs by default on boot
We make heavy use of BPF functionality these days, hence expose the BPF
file system too by default now. (Note however, that we don't actually
make use bpf file systems object yet, but we might later on too.)
2018-02-21 16:43:36 +01:00
Zbigniew Jędrzejewski-Szmek 56c8d7444a pid1: do not initialize join_controllers by default
We're moving towards unified cgroup hierarchy where this is not necessary.
This makes main.c a bit simpler.
2018-02-19 15:18:54 +01:00
Lennart Poettering 713a88757a mount-setup: fix MNT_CHECK_WRITABLE error handling, and log about the issue
Let's correct the error handling (the error is in errno, not r), and
let's add logging like the rest of the function has it.
2017-12-15 20:52:28 +01:00
Krzysztof Nowicki 8739f23e3c Fix SELinux labels in cgroup filesystem root directory (#7496)
When using SELinux with legacy cgroups the tmpfs on /sys/fs/cgroup is by
default labelled as tmpfs_t. This label is also inherited by the "cpu"
and "cpuacct" symbolic links. Unfortunately the policy expects them to
be labelled as cgroup_t, which is used for all the actual cgroup
filesystems. Failure to do so results in a stream of denials.

This state cannot be fixed reliably when the cgroup filesystem structure
is set-up as the SELinux policy is not yet loaded at this
moment. It also cannot be fixed later as the root of the cgroup
filesystem is remounted read-only. In order to fix it the root of the
cgroup filesystem needs to be temporary remounted read-write, relabelled
and remounted back read-only.
2017-11-30 11:59:29 +01:00
Christian Brauner 1ff654e28b core: remove empty cgroups (#7457)
When we skip an unwritable cgroup also remove the empty mountpoint.
2017-11-24 21:05:16 +01:00