diff --git a/NEWS b/NEWS index bcca3a94d9..60ef049c2e 100644 --- a/NEWS +++ b/NEWS @@ -285,6 +285,15 @@ CHANGES WITH 239 in spe: query the default, built-in $PATH PID 1 will pass to the services it manages. + * A new unit file setting PrivateMounts= has been added. It's a boolean + option. If enabled the unit's processes are invoked in their own file + system namespace. Note that this behaviour is also implied if any + other file system namespacing options (such as PrivateTmp=, + PrivateDevices=, ProtectSystem=, …) are used. This option is hence + primarily useful for services that do not use any of the other file + system namespacing options. One such service is systemd-udevd.service + wher this is now used by default. + Contributions from: Adam Duskett, Alan Jenkins, Alessandro Casale, Alexander Kurtz, Alex Gartrell, Anssi Hannula, Antique, Arnaud Rebillout, Brian J. Murrell, Bruno Vernay, Chris Lesiak, Christian diff --git a/man/systemd.exec.xml b/man/systemd.exec.xml index fc24f97103..21bc26ac09 100644 --- a/man/systemd.exec.xml +++ b/man/systemd.exec.xml @@ -1277,28 +1277,69 @@ RestrictNamespaces=~cgroup net stopped. This setting is implied if DynamicUser= is set. + + PrivateMounts= + + Takes a boolean parameter. If set, the processes of this unit will be run in their own private + file system (mount) namespace with all mount propagation from the processes towards the host's main file system + namespace turned off. This means any file system mount points established or removed by the unit's processes + will be private to them and not be visible to the host. However, file system mount points established or + removed on the host will be propagated to the unit's processes. See mount_namespaces7 for + details on file system namespaces. Defaults to off. + + When turned on, this executes three operations for each invoked process: a new + CLONE_NEWNS namespace is created, after which all existing mounts are remounted to + MS_SLAVE to disable propagation from the unit's processes to the host (but leaving + propagation in the opposite direction in effect). Finally, the mounts are remounted again to the propagation + mode configured with MountFlags=, see below. + + File system namespaces are set up individually for each process forked off by the service manager. Mounts + established in the namespace of the process created by ExecStartPre= will hence be cleaned + up automatically as soon as that process exits and will not be available to subsequent processes forked off for + ExecStart= (and similar applies to the various other commands configured for + units). Similarly, JoinsNamespaceOf= does not permit sharing kernel mount namespaces between + units, it only enables sharing of the /tmp/ and /var/tmp/ + directories. + + Other file system namespace unit settings — PrivateMounts=, + PrivateTmp=, PrivateDevices=, ProtectSystem=, + ProtectHome=, ReadOnlyPaths=, InaccessiblePaths=, + ReadWritePaths=, … — also enable file system namespacing in a fashion equivalent to this + option. Hence it is primarily useful to explicitly request this behaviour if none of the other settings are + used. + + MountFlags= - Takes a mount propagation flag: , or - , which control whether mounts in the file system namespace set up for this unit's - processes will receive or propagate mounts and unmounts. See mount2 for - details. Defaults to . Use to ensure that mounts and unmounts - are propagated from systemd's namespace to the service's namespace and vice versa. Use - to run processes so that none of their mounts and unmounts will propagate to the host. Use - to also ensure that no mounts and unmounts from the host will propagate into the unit - processes' namespace. If this is set to or , any mounts created - by spawned processes will be unmounted after the completion of the current command line of - ExecStartPre=, ExecStartPost=, ExecStart=, and - ExecStopPost=. Note that means that file systems mounted on the host - might stay mounted continuously in the unit's namespace, and thus keep the device busy. Note that the file - system namespace related options (PrivateTmp=, PrivateDevices=, - ProtectSystem=, ProtectHome=, ProtectKernelTunables=, - ProtectControlGroups=, ReadOnlyPaths=, - InaccessiblePaths=, ReadWritePaths=) require that mount and unmount - propagation from the unit's file system namespace is disabled, and hence downgrade to - . + Takes a mount propagation setting: , or + , which controls whether file system mount points in the file system namespaces set up + for this unit's processes will receive or propagate mounts and unmounts from other file system namespaces. See + mount2 + for details on mount propagation, and the three propagation flags in particular. + + This setting only controls the final propagation setting in effect on all mount + points of the file system namespace created for each process of this unit. Other file system namespacing unit + settings (see the discussion in PrivateMounts= above) will implicitly disable mount and + unmount propagation from the unit's processes towards the host by changing the propagation setting of all mount + points in the unit's file system namepace to first. Setting this option to + does not reestablish propagation in that case. Conversely, if this option is set, but + no other file system namespace setting is used, then new file system namespaces will be created for the unit's + processes and this propagation flag will be applied right away to all mounts within it, without the + intermediary application of . + + If not set – but file system namespaces are enabled through another file system namespace unit setting – + mount propagation is used, but — as mentioned — as is applied + first, propagation from the unit's processes to the host is still turned off. + + It is not recommended to to use mount propagation for units, as this means + temporary mounts (such as removable media) of the host will stay mounted and thus indefinitely busy in forked + off processes, as unmount propagation events won't be received by the file system namespace of the unit. + + Usually, it is best to leave this setting unmodified, and use higher level file system namespacing + options instead, in particular PrivateMounts=, see above. + diff --git a/src/core/dbus-execute.c b/src/core/dbus-execute.c index 8c752ceaa6..747b9d8eeb 100644 --- a/src/core/dbus-execute.c +++ b/src/core/dbus-execute.c @@ -744,6 +744,7 @@ const sd_bus_vtable bus_exec_vtable[] = { SD_BUS_PROPERTY("ProtectControlGroups", "b", bus_property_get_bool, offsetof(ExecContext, protect_control_groups), SD_BUS_VTABLE_PROPERTY_CONST), SD_BUS_PROPERTY("PrivateNetwork", "b", bus_property_get_bool, offsetof(ExecContext, private_network), SD_BUS_VTABLE_PROPERTY_CONST), SD_BUS_PROPERTY("PrivateUsers", "b", bus_property_get_bool, offsetof(ExecContext, private_users), SD_BUS_VTABLE_PROPERTY_CONST), + SD_BUS_PROPERTY("PrivateMounts", "b", bus_property_get_bool, offsetof(ExecContext, private_mounts), SD_BUS_VTABLE_PROPERTY_CONST), SD_BUS_PROPERTY("ProtectHome", "s", property_get_protect_home, offsetof(ExecContext, protect_home), SD_BUS_VTABLE_PROPERTY_CONST), SD_BUS_PROPERTY("ProtectSystem", "s", property_get_protect_system, offsetof(ExecContext, protect_system), SD_BUS_VTABLE_PROPERTY_CONST), SD_BUS_PROPERTY("SameProcessGroup", "b", bus_property_get_bool, offsetof(ExecContext, same_pgrp), SD_BUS_VTABLE_PROPERTY_CONST), @@ -1110,6 +1111,9 @@ int bus_exec_context_set_transient_property( if (streq(name, "PrivateDevices")) return bus_set_transient_bool(u, name, &c->private_devices, message, flags, error); + if (streq(name, "PrivateMounts")) + return bus_set_transient_bool(u, name, &c->private_mounts, message, flags, error); + if (streq(name, "PrivateNetwork")) return bus_set_transient_bool(u, name, &c->private_network, message, flags, error); diff --git a/src/core/execute.c b/src/core/execute.c index 2c64e08176..6aa4ec9c78 100644 --- a/src/core/execute.c +++ b/src/core/execute.c @@ -1780,6 +1780,7 @@ static bool exec_needs_mount_namespace( return true; if (context->private_devices || + context->private_mounts || context->protect_system != PROTECT_SYSTEM_NO || context->protect_home != PROTECT_HOME_NO || context->protect_kernel_tunables || @@ -2312,7 +2313,7 @@ static int apply_mount_namespace( _cleanup_strv_free_ char **empty_directories = NULL; char *tmp = NULL, *var = NULL; const char *root_dir = NULL, *root_image = NULL; - NamespaceInfo ns_info = {}; + NamespaceInfo ns_info; bool needs_sandboxing; BindMount *bind_mounts = NULL; size_t n_bind_mounts = 0; @@ -2342,16 +2343,7 @@ static int apply_mount_namespace( if (r < 0) return r; - /* - * If DynamicUser=no and RootDirectory= is set then lets pass a relaxed - * sandbox info, otherwise enforce it, don't ignore protected paths and - * fail if we are enable to apply the sandbox inside the mount namespace. - */ - if (!context->dynamic_user && root_dir) - ns_info.ignore_protect_paths = true; - needs_sandboxing = (params->flags & EXEC_APPLY_SANDBOXING) && !(command->flags & EXEC_COMMAND_FULLY_PRIVILEGED); - if (needs_sandboxing) ns_info = (NamespaceInfo) { .ignore_protect_paths = false, @@ -2360,7 +2352,19 @@ static int apply_mount_namespace( .protect_kernel_tunables = context->protect_kernel_tunables, .protect_kernel_modules = context->protect_kernel_modules, .mount_apivfs = context->mount_apivfs, + .private_mounts = context->private_mounts, }; + else if (!context->dynamic_user && root_dir) + /* + * If DynamicUser=no and RootDirectory= is set then lets pass a relaxed + * sandbox info, otherwise enforce it, don't ignore protected paths and + * fail if we are enable to apply the sandbox inside the mount namespace. + */ + ns_info = (NamespaceInfo) { + .ignore_protect_paths = true, + }; + else + ns_info = (NamespaceInfo) {}; r = setup_namespace(root_dir, root_image, &ns_info, context->read_write_paths, diff --git a/src/core/execute.h b/src/core/execute.h index 9ca68c9fe3..cba079c413 100644 --- a/src/core/execute.h +++ b/src/core/execute.h @@ -228,6 +228,7 @@ struct ExecContext { bool private_network; bool private_devices; bool private_users; + bool private_mounts; ProtectSystem protect_system; ProtectHome protect_home; bool protect_kernel_tunables; diff --git a/src/core/load-fragment-gperf.gperf.m4 b/src/core/load-fragment-gperf.gperf.m4 index 44c9978c54..15fb47838c 100644 --- a/src/core/load-fragment-gperf.gperf.m4 +++ b/src/core/load-fragment-gperf.gperf.m4 @@ -114,6 +114,7 @@ $1.ProtectKernelModules, config_parse_bool, 0, $1.ProtectControlGroups, config_parse_bool, 0, offsetof($1, exec_context.protect_control_groups) $1.PrivateNetwork, config_parse_bool, 0, offsetof($1, exec_context.private_network) $1.PrivateUsers, config_parse_bool, 0, offsetof($1, exec_context.private_users) +$1.PrivateMounts, config_parse_bool, 0, offsetof($1, exec_context.private_mounts) $1.ProtectSystem, config_parse_protect_system, 0, offsetof($1, exec_context.protect_system) $1.ProtectHome, config_parse_protect_home, 0, offsetof($1, exec_context.protect_home) $1.MountFlags, config_parse_exec_mount_flags, 0, offsetof($1, exec_context.mount_flags) diff --git a/src/core/namespace.c b/src/core/namespace.c index 24da3b8a64..2523c2a47f 100644 --- a/src/core/namespace.c +++ b/src/core/namespace.c @@ -1133,9 +1133,9 @@ int setup_namespace( _cleanup_free_ void *root_hash = NULL; MountEntry *m, *mounts = NULL; size_t root_hash_size = 0; - bool make_slave = false; const char *root; size_t n_mounts; + bool make_slave; bool require_prefix = false; int r = 0; @@ -1200,8 +1200,7 @@ int setup_namespace( protect_home, protect_system); /* Set mount slave mode */ - if (root || n_mounts > 0) - make_slave = true; + make_slave = root || n_mounts > 0 || ns_info->private_mounts; if (n_mounts > 0) { m = mounts = (MountEntry *) alloca0(n_mounts * sizeof(MountEntry)); diff --git a/src/core/namespace.h b/src/core/namespace.h index 705eb4e13a..e0e8e09e0f 100644 --- a/src/core/namespace.h +++ b/src/core/namespace.h @@ -50,6 +50,7 @@ typedef enum ProtectSystem { struct NamespaceInfo { bool ignore_protect_paths:1; bool private_dev:1; + bool private_mounts:1; bool protect_control_groups:1; bool protect_kernel_tunables:1; bool protect_kernel_modules:1; diff --git a/src/shared/bus-unit-util.c b/src/shared/bus-unit-util.c index 64b7ac8d69..01d820349a 100644 --- a/src/shared/bus-unit-util.c +++ b/src/shared/bus-unit-util.c @@ -699,7 +699,7 @@ static int bus_append_execute_property(sd_bus_message *m, const char *field, con if (STR_IN_SET(field, "IgnoreSIGPIPE", "TTYVHangup", "TTYReset", "TTYVTDisallocate", "PrivateTmp", "PrivateDevices", "PrivateNetwork", "PrivateUsers", - "NoNewPrivileges", "SyslogLevelPrefix", + "PrivateMounts", "NoNewPrivileges", "SyslogLevelPrefix", "MemoryDenyWriteExecute", "RestrictRealtime", "DynamicUser", "RemoveIPC", "ProtectKernelTunables", "ProtectKernelModules", "ProtectControlGroups", "MountAPIVFS", "CPUSchedulingResetOnFork", "LockPersonality")) diff --git a/units/systemd-udevd.service.in b/units/systemd-udevd.service.in index 8557522e7b..2b9fa69d9b 100644 --- a/units/systemd-udevd.service.in +++ b/units/systemd-udevd.service.in @@ -25,7 +25,7 @@ ExecStart=@rootlibexecdir@/systemd-udevd KillMode=mixed WatchdogSec=3min TasksMax=infinity -MountFlags=slave +PrivateMounts=yes MemoryDenyWriteExecute=yes RestrictRealtime=yes RestrictAddressFamilies=AF_UNIX AF_NETLINK AF_INET AF_INET6