Commit graph

276 commits

Author SHA1 Message Date
Topi Miettinen 07ce74074d namespace: avoid assertion failure (#6649)
If the root image is not decrypted, it must not be relinquished.
2017-08-29 17:31:24 +02:00
Nicolas Iooss 3a0bf6d6aa namespace: keep selinuxfs mounted read-write with ProtectKernelTunables (#5741)
When a service unit uses "ProtectKernelTunables=yes", it currently
remounts /sys/fs/selinux read-only. This makes libselinux report SELinux
state as "disabled", because most SELinux features are not usable. For
example it is not possible to validate security contexts (with
security_check_context_raw() or /sys/fs/selinux/context). This behavior
of libselinux has been described in
http://danwalsh.livejournal.com/73099.html and confirmed in a recent
email, https://marc.info/?l=selinux&m=149220233032594&w=2 .

Since commit 0c28d51ac8 ("units: further lock down our long-running
services"), systemd-localed unit uses ProtectKernelTunables=yes.
Nevertheless this service needs to use libselinux API in order to create
/etc/vconsole.conf, /etc/locale.conf... with the right SELinux contexts.
This is broken when /sys/fs/selinux is mounted read-only in the mount
namespace of the service.

Make SELinux-aware systemd services work again when they are using
ProtectKernelTunables=yes by keeping selinuxfs mounted read-write.
2017-07-31 17:45:33 +02:00
Timothée Ravier ac9de0b379 core: open /proc/self/mountinfo early to allow mounts over /proc (#5985)
Enable masking the /proc folder using the 'InaccessiblePaths' unit
option.

This also slightly simplify mounts setup as the bind_remount_recursive
function will only open /proc/self/mountinfo once.

This is based on the suggestion at:
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038634.html
2017-05-19 14:38:40 +02:00
Djalal Harouni 9c988f934b namespace: Apply MountAPIVFS= only when a Root directory is set
The MountAPIVFS= documentation says that this options has no effect
unless used in conjunction with RootDirectory= or RootImage= ,lets fix
this and avoid to create private mount namespaces where it is not
needed.
2017-03-05 21:39:43 +01:00
Djalal Harouni 10404d52e3 namespace: create base-filesystem directories if RootImage= or RootDirectory= are set
When a service is started with its own file system image, always try to
create the base-filesystem directories that are needed. This implicitly
covers the directories handled by MountAPIVFS= {/proc|/sys|/dev}.

Mount protections or MountAPIVFS= mounts were never applied if we
changed the root directory and the related paths were not present under
the new root. The mounts were silently. Fix this by creating those
directories if they are missing.

Closes https://github.com/systemd/systemd/issues/5488
2017-03-05 21:19:29 +01:00
AsciiWolf 13e785f7a0 Fix missing space in comments (#5439) 2017-02-24 18:14:02 +01:00
Lennart Poettering 78ebe98061 core,nspawn,dissect: make nspawn's .roothash file search reusable
This makes nspawn's logic of automatically discovering the root hash of
an image file generic, and then reuses it in systemd-dissect and in
PID1's RootImage= logic, so that verity is automatically set up whenever
we can.
2017-02-07 12:21:28 +01:00
Lennart Poettering 915e6d1676 core: add RootImage= setting for using a specific image file as root directory for a service
This is similar to RootDirectory= but mounts the root file system from a
block device or loopback file instead of another directory.

This reuses the image dissector code now used by nspawn and
gpt-auto-discovery.
2017-02-07 12:19:42 +01:00
Lennart Poettering 5d997827e2 core: add a per-unit setting MountAPIVFS= for mounting /dev, /proc, /sys in conjunction with RootDirectory=
This adds a boolean unit file setting MountAPIVFS=. If set, the three
main API VFS mounts will be mounted for the service. This only has an
effect on RootDirectory=, which it makes a ton times more useful.

(This is basically the /dev + /proc + /sys mounting code posted in the
original #4727, but rebased on current git, and with the automatic logic
replaced by explicit logic controlled by a unit file setting)
2017-02-07 11:22:05 +01:00
Lennart Poettering 1eb7e08e20 core: fix minor memleak in namespace.c
The source_malloc field wants to be freed, too.
2017-02-07 11:22:05 +01:00
Lennart Poettering d2d6c096f6 core: add ability to define arbitrary bind mounts for services
This adds two new settings BindPaths= and BindReadOnlyPaths=. They allow
defining arbitrary bind mounts specific to particular services. This is
particularly useful for services with RootDirectory= set as this permits making
specific bits of the host directory available to chrooted services.

The two new settings follow the concepts nspawn already possess in --bind= and
--bind-ro=, as well as the .nspawn settings Bind= and BindReadOnly= (and these
latter options should probably be renamed to BindPaths= and BindReadOnlyPaths=
too).

Fixes: #3439
2016-12-14 00:54:10 +01:00
Lennart Poettering 8fceda937f namespace: instead of chasing mount symlinks a priori, do so as-we-go
This is relevant as many of the mounts we try to establish only can be followed
when some other prior mount that is a prefix of it is established. Hence: move
the symlink chasing into the actual mount functions, so that we do it as late
as possibly but as early as necessary.

Fixes: #4588
2016-12-14 00:51:37 +01:00
Lennart Poettering 34de407a4f core: rename BindMount structure → MountEntry
After all, these don#t strictly encapsulate bind mounts anymore, and we are
preparing this for adding arbitrary user-defined bind mounts in a later commit,
at which point this would become really confusing. Let's clean this up, rename
the BindMount structure to MountEntry, so that it is clear that it can contain
information about any kind of mount.
2016-12-14 00:48:52 +01:00
Lennart Poettering cfbeb4ef8d namespace: add explicit read-only flag
This reworks handling of the read-only management for mount points. This will
become handy as soon as we add arbitrary bind mount support (which comes in a
later commit).
2016-12-14 00:42:01 +01:00
Lennart Poettering ddbe041277 namespace: reindent protect_system_strict_table[] as well
All other tables got reindented, but one was forgotten. Fix that.
2016-12-13 21:22:13 +01:00
Lennart Poettering c4f4fce79e fs-util: add flags parameter to chase_symlinks()
Let's remove chase_symlinks_prefix() and instead introduce a flags parameter to
chase_symlinks(), with a flag CHASE_PREFIX_ROOT that exposes the behaviour of
chase_symlinks_prefix().
2016-12-01 00:25:51 +01:00
Lennart Poettering e187369587 tree-wide: stop using canonicalize_file_name(), use chase_symlinks() instead
Let's use chase_symlinks() everywhere, and stop using GNU
canonicalize_file_name() everywhere. For most cases this should not change
behaviour, however increase exposure of our function to get better tested. Most
importantly in a few cases (most notably nspawn) it can take the correct root
directory into account when chasing symlinks.
2016-12-01 00:25:51 +01:00
Lennart Poettering aa70f38b5c namespace: clarify that /proc/apm is obsolete, but leave it blocked 2016-11-17 18:10:30 +01:00
Lennart Poettering c6232fb0e9 namespace: reindent namespace tables
Let's align all our BindMount tables, let's use the same column widths in all
of them, and let's make them not any wider than necessary.

This only changes whitespace, not contents of any of the tables.
2016-11-17 18:09:16 +01:00
Lennart Poettering 5327c910d2 namespace: simplify, optimize and extend handling of mounts for namespace
This changes a couple of things in the namespace handling:

It merges the BindMount and TargetMount structures. They are mostly the same,
hence let's just use the same structue, and rely on C's implicit zero
initialization of partially initialized structures for the unneeded fields.

This reworks memory management of each entry a bit. It now contains one "const"
and one "malloc" path. We use the former whenever we can, but use the latter
when we have to, which is the case when we have to chase symlinks or prefix a
root directory. This means in the common case we don't actually need to
allocate any dynamic memory. To make this easy to use we add an accessor
function bind_mount_path() which retrieves the right path string from a
BindMount structure.

While we are at it, also permit "+" as prefix for dirs configured with
ReadOnlyPaths= and friends: if specified the root directory of the unit is
implicited prefixed.

This also drops set_bind_mount() and uses C99 structure initialization instead,
which I think is more readable and clarifies what is being done.

This drops append_protect_kernel_tunables() and
append_protect_kernel_modules() as append_static_mounts() is now simple enough
to be called directly.

Prefixing with the root dir is now done in an explicit step in
prefix_where_needed(). It will prepend the root directory on each entry that
doesn't have it prefixed yet. The latter is determined depending on an extra
bit in the BindMount structure.
2016-11-17 18:08:32 +01:00
Djalal Harouni 1d54cd5d25 core:namespace: count and free failed paths inside chase_all_symlinks() (#4619)
This certainly fixes a bug that was introduced by PR
https://github.com/systemd/systemd/pull/4594 that intended to fix
https://github.com/systemd/systemd/issues/4567.

The fix was not complete. This patch makes sure that we count and free
all paths that fail inside chase_all_symlinks().

Fixes https://github.com/systemd/systemd/issues/4567
2016-11-10 12:11:37 -05:00
Djalal Harouni af964954c6 core: on DynamicUser= make sure that protecting sensitive paths is enforced (#4596)
This adds a variable that is always set to false to make sure that
protect paths inside sandbox are always enforced and not ignored. The only
case when it is set to true is on DynamicUser=no and RootDirectory=/chroot
is set. This allows users to use more our sandbox features inside RootDirectory=

The only exception is ProtectSystem=full|strict and when DynamicUser=yes
is implied. Currently RootDirectory= is not fully compatible with these
due to two reasons:

* /chroot/usr|etc has to be present on ProtectSystem=full
* /chroot// has to be a mount point on ProtectSystem=strict.
2016-11-08 21:57:32 -05:00
Zbigniew Jędrzejewski-Szmek 46c3230dd0 nspawn: slight simplification 2016-11-07 08:57:30 -05:00
Zbigniew Jędrzejewski-Szmek 49fedb4094 nspawn: avoid one strdup by using free_and_replace 2016-11-07 08:54:47 -05:00
Djalal Harouni f0a4feb0a5 core: make RootDirectory= and ProtectKernelModules= work
Instead of having two fields inside BindMount struct where one is stack
based and the other one is heap, use one field to store the full path
and updated it when we chase symlinks. This way we avoid dealing with
both at the same time.

This makes RootDirectory= work with ProtectHome= and ProtectKernelModules=yes

Fixes: https://github.com/systemd/systemd/issues/4567
2016-11-07 12:34:52 +01:00
Zbigniew Jędrzejewski-Szmek 605405c6cc tree-wide: drop NULL sentinel from strjoin
This makes strjoin and strjoina more similar and avoids the useless final
argument.

spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/systemd -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libsystemd/sd-bus -I ./src/libsystemd/sd-event -I ./src/libsystemd/sd-login -I ./src/libsystemd/sd-netlink -I ./src/libsystemd/sd-network -I ./src/libsystemd/sd-hwdb -I ./src/libsystemd/sd-device -I ./src/libsystemd/sd-id128 -I ./src/libsystemd-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/*.c)

git grep -e '\bstrjoin\b.*NULL' -l|xargs sed -i -r 's/strjoin\((.*), NULL\)/strjoin(\1)/'

This might have missed a few cases (spatch has a really hard time dealing
with _cleanup_ macros), but that's no big issue, they can always be fixed
later.
2016-10-23 11:43:27 -04:00
Djalal Harouni c575770b75 core:sandbox: lets make /lib/modules/ inaccessible on ProtectKernelModules=
Lets go further and make /lib/modules/ inaccessible for services that do
not have business with modules, this is a minor improvment but it may
help on setups with custom modules and they are limited... in regard of
kernel auto-load feature.

This change introduce NameSpaceInfo struct which we may embed later
inside ExecContext but for now lets just reduce the argument number to
setup_namespace() and merge ProtectKernelModules feature.
2016-10-12 14:11:16 +02:00
Djalal Harouni b6c432ca7e core:namespace: simplify ProtectHome= implementation
As with previous patch simplify ProtectHome and don't care about
duplicates, they will be sorted by most restrictive mode and cleaned.
2016-09-25 12:41:16 +02:00
Djalal Harouni f471b2afa1 core: simplify ProtectSystem= implementation
ProtectSystem= with all its different modes and other options like
PrivateDevices= + ProtectKernelTunables= + ProtectHome= are orthogonal,
however currently it's a bit hard to parse that from the implementation
view. Simplify it by giving each mode its own table with all paths and
references to other Protect options.

With this change some entries are duplicated, but we do not care since
duplicate mounts are first sorted by the most restrictive mode then
cleaned.
2016-09-25 12:21:25 +02:00
Djalal Harouni 49accde7bd core:sandbox: add more /proc/* entries to ProtectKernelTunables=
Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface,
filesystems configuration and IRQ tuning readonly.

Most of these interfaces now days should be in /sys but they are still
available through /proc, so just protect them. This patch does not touch
/proc/net/...
2016-09-25 11:30:11 +02:00
Djalal Harouni 2652c6c103 core:namespace: simplify mount calculation
Move out mount calculation on its own function. Actually the logic is
smart enough to later drop nop and duplicates mounts, this change
improves code readability.
---
 src/core/namespace.c | 47 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 36 insertions(+), 11 deletions(-)
2016-09-25 11:25:00 +02:00
Djalal Harouni 11a30cec2a core:namespace: put paths protected by ProtectKernelTunables= in
Instead of having all these paths everywhere, put the ones that are
protected by ProtectKernelTunables= into their own table. This way it
is easy to add paths and track which ones are protected.
2016-09-25 11:16:44 +02:00
Djalal Harouni 9c94d52e09 core:namespace: minor improvements to append_mounts() 2016-09-25 11:03:21 +02:00
Lennart Poettering cd2902c954 namespace: drop all mounts outside of the new root directory
There's no point in mounting these, if they are outside of the root directory
we'll move to.
2016-09-25 10:52:57 +02:00
Lennart Poettering 8f1ad200f0 namespace: don't make the root directory of a namespace a mount if it already is one
Let's not stack mounts needlessly.
2016-09-25 10:42:18 +02:00
Lennart Poettering d944dc9553 namespace: chase symlinks for mounts to set up in userspace
This adds logic to chase symlinks for all mount points that shall be created in
a namespace environment in userspace, instead of leaving this to the kernel.
This has the advantage that we can correctly handle absolute symlinks that
shall be taken relative to a specific root directory. Moreover, we can properly
handle mounts created on symlinked files or directories as we can merge their
mounts as necessary.

(This also drops the "done" flag in the namespace logic, which was never
actually working, but was supposed to permit a partial rollback of the
namespace logic, which however is only mildly useful as it wasn't clear in
which case it would or would not be able to roll back.)

Fixes: #3867
2016-09-25 10:42:18 +02:00
Lennart Poettering 1e4e94c881 namespace: invoke unshare() only after checking all parameters
Let's create the new namespace only after we validated and processed all
parameters, right before we start with actually mounting things.

This way, the window where we can roll back is larger (not that it matters
IRL...)
2016-09-25 10:42:18 +02:00
Lennart Poettering 3f815163ff core: introduce ProtectSystem=strict
Let's tighten our sandbox a bit more: with this change ProtectSystem= gains a
new setting "strict". If set, the entire directory tree of the system is
mounted read-only, but the API file systems /proc, /dev, /sys are excluded
(they may be managed with PrivateDevices= and ProtectKernelTunables=). Also,
/home and /root are excluded as those are left for ProtectHome= to manage.

In this mode, all "real" file systems (i.e. non-API file systems) are mounted
read-only, and specific directories may only be excluded via
ReadWriteDirectories=, thus implementing an effective whitelist instead of
blacklist of writable directories.

While we are at, also add /efi to the list of paths always affected by
ProtectSystem=. This is a follow-up for
b52a109ad3 which added /efi as alternative for
/boot. Our namespacing logic should respect that too.
2016-09-25 10:42:18 +02:00
Lennart Poettering 160cfdbed3 namespace: add some debug logging when enforcing InaccessiblePaths= 2016-09-25 10:42:18 +02:00
Lennart Poettering 6b7c9f8bce namespace: rework how ReadWritePaths= is applied
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths=
specification, then we'd first recursively apply the ReadOnlyPaths= paths, and
make everything below read-only, only in order to then flip the read-only bit
again for the subdirs listed in ReadWritePaths= below it.

This is not only ugly (as for the dirs in question we first turn on the RO bit,
only to turn it off again immediately after), but also problematic in
containers, where a container manager might have marked a set of dirs read-only
and this code will undo this is ReadWritePaths= is set for any.

With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be
applied to the children listed in ReadWritePaths= in the first place, so that
we do not need to turn off the RO bit for those after all.

This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the
RO bit, but never to turn it off again. Or to say this differently: if some
dirs are marked read-only via some external tool, then ReadWritePaths= will not
undo it.

This is not only the safer option, but also more in-line with what the man page
currently claims:

        "Entries (files or directories) listed in ReadWritePaths= are
        accessible from within the namespace with the same access rights as
        from outside."

To implement this change bind_remount_recursive() gained a new "blacklist"
string list parameter, which when passed may contain subdirs that shall be
excluded from the read-only mounting.

A number of functions are updated to add more debug logging to make this more
digestable.
2016-09-25 10:40:51 +02:00
Lennart Poettering 7648a565d1 namespace: when enforcing fs namespace restrictions suppress redundant mounts
If /foo is marked to be read-only, and /foo/bar too, then the latter may be
suppressed as it has no effect.
2016-09-25 10:19:15 +02:00
Lennart Poettering 6ee1a919cf namespace: simplify mount_path_compare() a bit 2016-09-25 10:19:10 +02:00
Lennart Poettering fe3c2583be namespace: make sure InaccessibleDirectories= masks all mounts further down
If a dir is marked to be inaccessible then everything below it should be masked
by it.
2016-09-25 10:18:51 +02:00
Lennart Poettering 59eeb84ba6 core: add two new service settings ProtectKernelTunables= and ProtectControlGroups=
If enabled, these will block write access to /sys, /proc/sys and
/proc/sys/fs/cgroup.
2016-09-25 10:18:48 +02:00
Martin Pitt 5c3c778014 Merge pull request #3764 from poettering/assorted-stuff-2
Assorted fixes
2016-07-22 09:10:04 +02:00
Topi Miettinen 176e51b710 namespace: fix wrong return value from mount(2) (#3758)
Fix bug introduced by #3263: mount(2) return value is 0 or -1, not errno.

Thanks to Evgeny Vereshchagin (@evverx) for reporting.
2016-07-20 17:43:21 +03:00
Lennart Poettering fe048ce56a namespace: add a (void) cast 2016-07-20 14:53:15 +02:00
Lennart Poettering 5fd7cf6fe2 namespace: minor improvements
We generally try to avoid strerror(), due to its threads-unsafety, let's do
this here, too.

Also, let's be tiny bit more explanatory with the log messages, and let's
shorten a few things.
2016-07-20 08:57:25 +02:00
Alessandro Puccetti 2a624c36e6 doc,core: Read{Write,Only}Paths= and InaccessiblePaths=
This patch renames Read{Write,Only}Directories= and InaccessibleDirectories=
to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept
as aliases but they are not advertised in the documentation.

Renamed variables:
`read_write_dirs` --> `read_write_paths`
`read_only_dirs` --> `read_only_paths`
`inaccessible_dirs` --> `inaccessible_paths`
2016-07-19 17:22:02 +02:00
Alessandro Puccetti c4b4170746 namespace: unify limit behavior on non-directory paths
Despite the name, `Read{Write,Only}Directories=` already allows for
regular file paths to be masked. This commit adds the same behavior
to `InaccessibleDirectories=` and makes it explicit in the doc.
This patch introduces `/run/systemd/inaccessible/{reg,dir,chr,blk,fifo,sock}`
{dile,device}nodes and mounts on the appropriate one the paths specified
in `InacessibleDirectories=`.

Based on Luca's patch from https://github.com/systemd/systemd/pull/3327
2016-07-19 17:22:02 +02:00
topimiettinen 737ba3c82c namespace: Make private /dev noexec and readonly (#3263)
Private /dev will not be managed by udev or others, so we can make it
noexec and readonly after we have made all device nodes. As /dev/shm
needs to be writable, we can't use bind_remount_recursive().
2016-05-15 22:34:05 -04:00
topimiettinen 9e5f825280 namespace: unmount old /dev under our new private /dev (#3254)
Drop all dangling old /dev mounts before mounting a new private /dev tree.
2016-05-14 12:46:23 -04:00
Daniel Mack 9ca6ff50ab Remove kdbus custom endpoint support
This feature will not be used anytime soon, so remove a bit of cruft.

The BusPolicy= config directive will stay around as compat noop.
2016-02-11 22:12:04 +01:00
Daniel Mack b26fa1a2fb tree-wide: remove Emacs lines from all files
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
2016-02-10 13:41:57 +01:00
Lennart Poettering b5efdb8af4 util-lib: split out allocation calls into alloc-util.[ch] 2015-10-27 13:45:53 +01:00
Lennart Poettering ee104e11e3 user-util: move UID/GID related macros from macro.h to user-util.h 2015-10-27 13:25:57 +01:00
Lennart Poettering affb60b1ef util-lib: split out umask-related code to umask-util.h 2015-10-27 13:25:56 +01:00
Lennart Poettering 8b43440b7e util-lib: move string table stuff into its own string-table.[ch] 2015-10-27 13:25:56 +01:00
Lennart Poettering 4349cd7c1d util-lib: move mount related utility calls to mount-util.[ch] 2015-10-27 13:25:55 +01:00
Lennart Poettering 2583fbea8e socket-util: move remaining socket-related calls from util.[ch] to socket-util.[ch] 2015-10-26 01:24:39 +01:00
Lennart Poettering 3ffd4af220 util-lib: split out fd-related operations into fd-util.[ch]
There are more than enough to deserve their own .c file, hence move them
over.
2015-10-25 13:19:18 +01:00
Lennart Poettering 07630cea1f util-lib: split our string related calls from util.[ch] into its own file string-util.[ch]
There are more than enough calls doing string manipulations to deserve
its own files, hence do something about it.

This patch also sorts the #include blocks of all files that needed to be
updated, according to the sorting suggestions from CODING_STYLE. Since
pretty much every file needs our string manipulation functions this
effectively means that most files have sorted #include blocks now.

Also touches a few unrelated include files.
2015-10-24 23:05:02 +02:00
Lennart Poettering 3ee897d6c2 tree-wide: port more code to use send_one_fd() and receive_one_fd()
Also, make it slightly more powerful, by accepting a flags argument, and
make it safe for handling if more than one cmsg attribute happens to be
attached.
2015-09-29 21:08:37 +02:00
Lennart Poettering 1f6b411372 tree-wide: update empty-if coccinelle script to cover empty-while and more
Let's also clean up single-line while and for blocks.
2015-09-09 14:59:51 +02:00
Lennart Poettering 94c156cd45 tree-wide: make use of log_error_errno() return value in more cases
The previous coccinelle semantic patch that improved usage of
log_error_errno()'s return value, only looked for log_error_errno()
invocations with a single parameter after the error parameter. Update
the patch to handle arbitrary numbers of additional arguments.
2015-09-09 14:58:26 +02:00
Lennart Poettering 76ef789d26 tree-wide: make use of log_error_errno() return value
Turns this:

        r = -errno;
        log_error_errno(errno, "foo");

into this:

        r = log_error_errno(errno, "foo");

and this:

        r = log_error_errno(errno, "foo");
        return r;

into this:

        return log_error_errno(errno, "foo");
2015-09-09 08:20:20 +02:00
Lennart Poettering 2a1288ff89 util: introduce CMSG_FOREACH() macro and make use of it everywhere
It's only marginally shorter then the usual for() loop, but certainly
more readable.
2015-06-10 19:29:47 +02:00
Jason Pleau d38e01dc96 core/namespace: Protect /usr instead of /home with ProtectSystem=yes
A small typo in ee818b8 caused /home to be put in read-only instead of
/usr when ProtectSystem was enabled (ie: not set to "no").
2015-05-31 20:29:36 +02:00
Lennart Poettering 03cfe0d514 nspawn: finish user namespace support 2015-05-21 16:32:01 +02:00
Lennart Poettering 6458ec20b5 core,nspawn: unify code that moves the root dir 2015-05-20 14:38:12 +02:00
Alban Crequy ee818b89f4 core: Private*/Protect* options with RootDirectory
When a service is chrooted with the option RootDirectory=/opt/..., then
the options PrivateDevices, PrivateTmp, ProtectHome, ProtectSystem must
mount the directories under $RootDirectory/{dev,tmp,home,usr,boot}.

The test-ns tool can test setup_namespace() with and without chroot:
 $ sudo TEST_NS_PROJECTS=/home/lennart/projects ./test-ns
 $ sudo TEST_NS_CHROOT=/home/alban/debian-tree TEST_NS_PROJECTS=/home/alban/debian-tree/home/alban/Documents ./test-ns
2015-05-18 18:47:45 +02:00
Lennart Poettering 5a8af538ae nspawn: rework custom mount point order, and add support for overlayfs
Previously all bind mount mounts were applied in the order specified,
followed by all tmpfs mounts in the order specified. This is
problematic, if bind mounts shall be placed within tmpfs mounts.

This patch hence reworks the custom mount point logic, and alwas applies
them in strict prefix-first order. This means the order of mounts
specified on the command line becomes irrelevant, the right operation
will always be executed.

While we are at it this commit also adds native support for overlayfs
mounts, as supported by recent kernels.
2015-05-13 14:07:26 +02:00
Iago López Galeiras 4543768d13 nspawn: change filesystem type from "bind" to NULL in mount() syscalls
Try to keep syscalls as minimal as possible.
2015-03-31 15:36:53 +02:00
Michal Schmidt a0827e2b12 core/namespace: fix path sorting
The comparison function we use for qsorting paths is overly indifferent.
Consider these 3 paths for sorting:
 /foo
 /bar
 /foo/foo
qsort() may compare:
 "/foo" with "/bar" => 0, indifference
 "/bar" with "/foo/foo" => 0, indifference
and assume transitively that "/foo" and "/foo/foo" are also indifferent.

But this is wrong, we want "/foo" sorted before "/foo/foo".
The comparison function must be transitive.

Use path_compare(), which behaves properly.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1184016
2015-03-16 22:17:15 +01:00
Zbigniew Jędrzejewski-Szmek 42b1b9907d core: explicitly ignore failure during cleanup
CID #1237550.
2015-03-13 23:42:17 -04:00
Zbigniew Jędrzejewski-Szmek 3164e3cbc5 core: either ignore or handle mount failures
/dev/pts/ptmx is as important as /dev/pts, so error out if that
fails. Others seem less important, since the namespace is usable
without them, so ignore failures.

CID #123755, #123754.
2015-03-13 23:42:17 -04:00
Zbigniew Jędrzejewski-Szmek dc75168823 Use space after a silencing (void)
We were using a space more often than not, and this way is
codified in CODING_STYLE.
2015-03-13 23:42:17 -04:00
Thomas Hindoe Paaboel Andersen 2eec67acbb remove unused includes
This patch removes includes that are not used. The removals were found with
include-what-you-use which checks if any of the symbols from a header is
in use.
2015-02-23 23:53:42 +01:00
Lennart Poettering 63c372cb9d util: rework strappenda(), and rename it strjoina()
After all it is now much more like strjoin() than strappend(). At the
same time, add support for NULL sentinels, even if they are normally not
necessary.
2015-02-03 02:05:59 +01:00
Topi Miettinen e65476622d Type of mount(2) flags is unsigned long 2015-01-01 14:39:17 -05:00
Lennart Poettering d7b8eec7dc tmpfiles: add new line type 'v' for creating btrfs subvolumes 2014-12-28 02:08:40 +01:00
Michal Schmidt 4a62c710b6 treewide: another round of simplifications
Using the same scripts as in f647962d64 "treewide: yet more log_*_errno
+ return simplifications".
2014-11-28 19:57:32 +01:00
Michal Schmidt 56f64d9576 treewide: use log_*_errno whenever %m is in the format string
If the format string contains %m, clearly errno must have a meaningful
value, so we might as well use log_*_errno to have ERRNO= logged.

Using:
find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\((".*%m.*")/log_\1_errno(errno, \2/'

Plus some whitespace, linewrap, and indent adjustments.
2014-11-28 19:49:27 +01:00
Susant Sahani b77acbcf7d namespace: unchecked return value from library
fix:

CID 1237553 (#1 of 6): Unchecked return value from library
(CHECKED_RETURN

CID 1237553 (#3 of 6): Unchecked return value from library
(CHECKED_RETURN)

CID 1237553 (#4 of 6): Unchecked return value from library
(CHECKED_RETURN)

CID 1237553 (#5 of 6): Unchecked return value from library
(CHECKED_RETURN

CID 1237553 (#6 of 6): Unchecked return value from library
(CHECKED_RETURN)
2014-11-17 12:06:40 +01:00
Daniel Mack 63cc4c3138 sd-bus: sync with kdbus upstream (ABI break)
kdbus has seen a larger update than expected lately, most notably with
kdbusfs, a file system to expose the kdbus control files:

 * Each time a file system of this type is mounted, a new kdbus
   domain is created.

 * The layout inside each mount point is the same as before, except
   that domains are not hierarchically nested anymore.

 * Domains are therefore also unnamed now.

 * Unmounting a kdbusfs will automatically also detroy the
   associated domain.

 * Hence, the action of creating a kdbus domain is now as
   privileged as mounting a filesystem.

 * This way, we can get around creating dev nodes for everything,
   which is last but not least something that is not limited by
   20-bit minor numbers.

The kdbus specific bits in nspawn have all been dropped now, as nspawn
can rely on the container OS to set up its own kdbus domain, simply by
mounting a new instance.

A new set of mounts has been added to mount things *after* the kernel
modules have been loaded. For now, only kdbus is in this set, which is
invoked with mount_setup_late().
2014-11-13 20:41:52 +01:00
Lennart Poettering ecabcf8b6e selinux: clean up selinux label function naming 2014-10-23 21:36:56 +02:00
WaLyong Cho cc56fafeeb mac: rename apis with mac_{selinux/smack}_ prefix 2014-10-23 17:13:15 +02:00
Lennart Poettering a004cb4cb2 namespace: add missing 'const' to parameters 2014-10-17 13:49:08 +02:00
Zbigniew Jędrzejewski-Szmek d267c5aa3d core/namespace: remove invalid check
dir cannot be NULL here, because it was allocated with alloca.

CID #1237768.
2014-10-03 20:42:09 -04:00
Zbigniew Jędrzejewski-Szmek 1775f1ebc4 core/namespace: remove invalid check
root cannot be NULL here, because it was allocated with alloca.

CID #1237769.
2014-10-03 20:42:09 -04:00
Thomas Hindoe Paaboel Andersen 120d578e5f namespace: avoid posible use of uninitialized variable 2014-09-08 22:09:41 +02:00
Daniel Mack a610cc4f18 namespace: add support for custom kdbus endpoint
If a path to a previously created custom kdbus endpoint is passed in,
bind-mount a new devtmpfs that contains a 'bus' node, which in turn in
bind-mounted with the custom endpoint. This tmpfs then mounted over the
kdbus subtree that refers to the current bus.

This way, we can fake the bus node in order to lock down services with
a kdbus custom endpoint policy.
2014-09-08 14:12:56 +02:00
Ansgar Burchardt e2d7c1a075 drop_duplicates: copy full BindMount struct
At least

  t->ignore = f->ignore;

is missing here. Just copy the full struct to be sure.
2014-07-27 15:15:11 -04:00
Lennart Poettering 664064d60c namespace: make sure /tmp, /var/tmp and /dev are writable in namespaces we set up 2014-07-03 16:28:26 +02:00
Lennart Poettering 002b226843 namespace: fix uninitialized memory access 2014-07-03 16:28:26 +02:00
Lennart Poettering dd078a1ef8 namespace: properly label device nodes we create
https://bugzilla.redhat.com/show_bug.cgi?id=1081429
2014-06-18 00:09:46 +02:00
Lennart Poettering 051be1f71c namespace: cover /boot with ProtectSystem= again
Now that we properly exclude autofs mounts from ProtectSystem= we can
include it in the effect of ProtectSystem= again.
2014-06-06 14:48:51 +02:00
Lennart Poettering d6797c920e namespace: beef up read-only bind mount logic
Instead of blindly creating another bind mount for read-only mounts,
check if there's already one we can use, and if so, use it. Also,
recursively mark all submounts read-only too. Also, ignore autofs mounts
when remounting read-only unless they are already triggered.
2014-06-06 14:37:40 +02:00
Lennart Poettering c8835999c3 namespace: also include /root in ProtectHome=
/root can't really be autofs, and is also a home, directory, so cover it
with ProtectHome=.
2014-06-05 21:55:06 +02:00
Lennart Poettering 6d313367d9 namespace: when setting up an inaccessible mount point, unmounting everything below
This has the benefit of not triggering any autofs mount points
unnecessarily.
2014-06-05 21:35:35 +02:00
Lennart Poettering 5331194c12 core: don't include /boot in effect of ProtectSystem=
This would otherwise unconditionally trigger any /boot autofs mount,
which we probably should avoid.

ProtectSystem= will now only cover /usr and (optionally) /etc, both of
which cannot be autofs anyway.

ProtectHome will continue to cover /run/user and /home. The former
cannot be autofs either. /home could be, however is frequently enough
used (unlikey /boot) so that it isn't too problematic to simply trigger
it unconditionally via ProtectHome=.
2014-06-05 10:03:26 +02:00
Lennart Poettering 1b8689f949 core: rename ReadOnlySystem= to ProtectSystem= and add a third value for also mounting /etc read-only
Also, rename ProtectedHome= to ProtectHome=, to simplify things a bit.

With this in place we now have two neat options ProtectSystem= and
ProtectHome= for protecting the OS itself (and optionally its
configuration), and for protecting the user's data.
2014-06-04 18:12:55 +02:00
Lennart Poettering e06b6479a5 core: provide /dev/ptmx as symlink in PrivateDevices= execution environments 2014-06-04 17:21:18 +02:00
Lennart Poettering 82d252404a core: make sure PrivateDevices= makes /dev/log available
Now that we moved the actual syslog socket to
/run/systemd/journal/dev-log we can actually make /dev/log a symlink to
it, when PrivateDevices= is used, thus making syslog available to
services using PrivateDevices=.
2014-06-04 16:59:13 +02:00
Lennart Poettering 417116f234 core: add new ReadOnlySystem= and ProtectedHome= settings for service units
ReadOnlySystem= uses fs namespaces to mount /usr and /boot read-only for
a service.

ProtectedHome= uses fs namespaces to mount /home and /run/user
inaccessible or read-only for a service.

This patch also enables these settings for all our long-running services.

Together they should be good building block for a minimal service
sandbox, removing the ability for services to modify the operating
system or access the user's private data.
2014-06-03 23:57:51 +02:00
Lennart Poettering c2c13f2df4 unit: turn off mount propagation for udevd
Keep mounts done by udev rules private to udevd. Also, document how
MountFlags= may be used for this.
2014-03-20 04:16:39 +01:00
Lennart Poettering 2b85f4e19c core: Beef up PrivateDevices=
Also mount /dev/kdbus, /dev/mqueue and /dev/hugepages into the /dev for
namespaced services.
2014-03-19 16:25:11 +01:00
Lennart Poettering 94828d2ddc conf-parser: config_parse_path_strv() is not generic, so let's move it into load-fragment.c
The parse code actually checked for specific lvalue names, which is
really wrong for supposedly generic parsers...
2014-03-03 21:40:55 +01:00
Lennart Poettering 7f112f50fe exec: introduce PrivateDevices= switch to provide services with a private /dev
Similar to PrivateNetwork=, PrivateTmp= introduce PrivateDevices= that
sets up a private /dev with only the API pseudo-devices like /dev/null,
/dev/zero, /dev/random, but not any physical devices in them.
2014-01-20 21:28:37 +01:00
Lennart Poettering 6b46ea73e3 namespace: include boot id in private tmp directories
This way it is easy to only exclude directories from the current boot
from automatic clean up in /var/tmp.

Also, pick a longer name for the directories so that are globs in
tmp.conf can be simpler yet equally accurate.
2013-12-13 04:06:43 +01:00
Lennart Poettering 76cd584b8d namespace: comment typo fix 2013-11-27 20:31:51 +01:00
Lennart Poettering 613b411c94 service: add the ability for units to join other unit's PrivateNetwork= and PrivateTmp= namespaces 2013-11-27 20:28:48 +01:00
Zbigniew Jędrzejewski-Szmek d8c9d3a468 systemd: use unit name in PrivateTmp directories
Unit name is used whole in the directory name, so that the unit name
can be easily extracted from it, e.g. "/tmp/systemd-abcd.service-DEDBIF1".

https://bugzilla.redhat.com/show_bug.cgi?id=957439
2013-10-22 22:54:09 -04:00
Zbigniew Jędrzejewski-Szmek 7ff7394d9e Never call qsort on potentially NULL arrays
This extends 62678ded 'efi: never call qsort on potentially
NULL arrays' to all other places where qsort is used and it
is not obvious that the count is non-zero.
2013-10-13 17:56:54 -04:00
Maciej Wereski ea92ae33e0 "-" prefix for InaccessibleDirectories and ReadOnlyDirectories 2013-08-23 12:48:14 -04:00
Zbigniew Jędrzejewski-Szmek d5a3f0eac7 core: remove unnecessary goto in setup_namespace 2013-03-20 19:16:01 -04:00
Zbigniew Jędrzejewski-Szmek d34cd37490 Make PrivateTmp dirs also inaccessible from the outside
Currently, PrivateTmp=yes means that the service cannot see the /tmp
shared by rest of the system and is isolated from other services using
PrivateTmp, but users can access and modify /tmp as seen by the
service.

Move the private /tmp and /var/tmp directories into a 0077-mode
directory. This way unpriviledged users on the system cannot see (or
modify) /tmp as seen by the service.
2013-03-20 14:08:41 -04:00
Michal Sekletar c17ec25e4d core: reuse the same /tmp, /var/tmp and inaccessible dir
All Execs within the service, will get mounted the same
/tmp and /var/tmp directories, if service is configured with
PrivateTmp=yes. Temporary directories are cleaned up by service
itself in addition to systemd-tmpfiles. Directory which is mounted
as inaccessible is created at runtime in /run/systemd.
2013-03-15 22:56:40 -04:00
Lennart Poettering 1e41be2015 nspawn,namespaces: make sure we recursively bind mount things in
We want to make sure that everything from the host is also visible in
the sandbox.
2012-08-13 16:25:03 +02:00
Lennart Poettering ac0930c892 namespace: rework namespace support
- don't use pivot_root() anymore, just reuse root hierarchy
- first create all mounts, then mark them read-only so that we get the
  right behaviour when people want writable mounts inside of
  read-only mounts
- don't pass invalid combinations of MS_ constants to the kernel
2012-08-13 15:27:04 +02:00
Lennart Poettering 64825d3c58 fix a couple of issues found with llvm-analyze 2012-08-08 23:54:21 +02:00
Lennart Poettering c1d70f7ca5 namespace: make PrivateTmp= apply to both /tmp and /var/tmp 2012-05-14 22:41:30 +02:00
Kay Sievers 9eb977db5b util: split-out path-util.[ch] 2012-05-08 02:33:10 +02:00
Kay Sievers 4d46fec56d remove MS_* which can not be combined with current kernel code
MS_BIND|MS_MOVE can not be combined:
  do_mount()
    else if (flags & MS_BIND)
      do_loopback(&path, dev_name, flags & MS_REC);
    [...]
    else if (flags & MS_MOVE)
      do_move_mount(&path, dev_name);

MS_REMOUNT|MS_UNBINDABLE can not be combined:
  do_mount()
    if (flags & MS_REMOUNT)
      do_remount(&path, flags & ~MS_REMOUNT, mnt_flags, data_page);
    [...]
    else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
      do_change_type(&path, flags);
2012-04-18 13:37:45 +02:00
Lennart Poettering 5430f7f2bc relicense to LGPLv2.1 (with exceptions)
We finally got the OK from all contributors with non-trivial commits to
relicense systemd from GPL2+ to LGPL2.1+.

Some udev bits continue to be GPL2+ for now, but we are looking into
relicensing them too, to allow free copy/paste of all code within
systemd.

The bits that used to be MIT continue to be MIT.

The big benefit of the relicensing is that closed source code may now
link against libsystemd-login.so and friends.
2012-04-12 00:24:39 +02:00
Kay Sievers b30e2f4c18 move libsystemd_core.la sources into core/ 2012-04-11 16:03:51 +02:00
Renamed from src/namespace.c (Browse further)