Systemd

Author	SHA1	Message	Date
Greg Depoire--Ferrer	6597686865	seccomp: don't install filters for archs that can't use syscalls When seccomp_restrict_archs is called, architectures that are blocked are replaced by the SECCOMP_LOCAL_ARCH_BLOCKED marker so that they are not disabled again and filters are not installed for them. This can make some service that use SystemCallArchitecture= and SystemCallFilter= start faster.	2020-12-10 16:13:02 +01:00
Yu Watanabe	db9ecf0501	license: LGPL-2.1+ -> LGPL-2.1-or-later	2020-11-09 13:23:58 +09:00
Topi Miettinen	005bfaf118	exec: Add kill action to system call filters Define explicit action "kill" for SystemCallErrorNumber=. In addition to errno code, allow specifying "kill" as action for SystemCallFilter=. --- v7: seccomp_parse_errno_or_action() returns -EINVAL if !HAVE_SECCOMP v6: use streq_ptr(), let errno_to_name() handle bad values, kill processes, init syscall_errno v5: actually use seccomp_errno_or_action_to_string(), don't fail bus unit parsing without seccomp v4: fix build without seccomp v3: drop log action v2: action -> number	2020-09-15 12:54:17 +03:00
Zbigniew Jędrzejewski-Szmek	000c05207d	shared/seccomp-util: added functionality to make list of filtred syscalls While at it, start removing the "seccomp_" prefix from our own functions. It is used by libseccomp.	2020-08-24 20:05:09 +02:00
Zbigniew Jędrzejewski-Szmek	95aac01259	shared: add @known syscall list	2020-08-24 20:04:17 +02:00
Lennart Poettering	6b000af4f2	tree-wide: avoid some loaded terms https://tools.ietf.org/html/draft-knodel-terminology-02 https://lwn.net/Articles/823224/ This gets rid of most but not occasions of these loaded terms: 1. scsi_id and friends are something that is supposed to be removed from our tree (see #7594) 2. The test suite defines an API used by the ubuntu CI. We can remove this too later, but this needs to be done in sync with the ubuntu CI. 3. In some cases the terms are part of APIs we call or where we expose concepts the kernel names the way it names them. (In particular all remaining uses of the word "slave" in our codebase are like this, it's used by the POSIX PTY layer, by the network subsystem, the mount API and the block device subsystem). Getting rid of the term in these contexts would mean doing some major fixes of the kernel ABI first. Regarding the replacements: when whitelist/blacklist is used as noun we replace with with allow list/deny list, and when used as verb with allow-list/deny-list.	2020-06-25 09:00:19 +02:00
Zbigniew Jędrzejewski-Szmek	de7fef4b6e	tree-wide: use set_ensure_put() Patch contains a coccinelle script, but it only works in some cases. Many parts were converted by hand. Note: I did not fix errors in return value handing. This will be done separate to keep the patch comprehensible. No functional change is intended in this patch.	2020-06-22 16:32:37 +02:00
Kevin Kuehler	620dbdd248	shared: Add ProtectKernelLogs property Add seccomp_protect_syslog, which adds a filter rule for the syslog system call.	2019-11-11 12:11:56 -08:00
Zbigniew Jędrzejewski-Szmek	9493b16871	Add @pkey syscall group Inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1769299. This change doesn't solve the issue, but makes it easier to whitelist the syscall group.	2019-11-08 14:41:22 +01:00
Lennart Poettering	915fb32438	seccomp: add scmp_act_kill_process() helper that returns SCMP_ACT_KILL_PROCESS if supported	2019-05-24 10:48:28 +02:00
Anita Zhang	7bc5e0b12b	seccomp: check more error codes from seccomp_load() We noticed in our tests that occasionally SystemCallFilter= would fail to set and the service would run with no syscall filtering. Most of the time the same tests would apply the filter and fail the service as expected. While it's not totally clear why this happens, we noticed seccomp_load() in the systemd code base would fail open for all errors except EPERM and EACCES. ENOMEM, EINVAL, and EFAULT seem like reasonable values to add to the error set based on what I gather from libseccomp code and man pages: -ENOMEM: out of memory, failed to allocate space for a libseccomp structure, or would exceed a defined constant -EINVAL: kernel isn't configured to support the operations, args are invalid (to seccomp_load(), seccomp(), or prctl()) -EFAULT: addresses passed as args are invalid	2019-04-12 10:23:07 +02:00
Zbigniew Jędrzejewski-Szmek	58f6ab4454	pid1: pass unit name to seccomp parser when we have no file location Building on previous commit, let's pass the unit name when parsing dbus message or builtin whitelist, which is better than nothing. seccomp_parse_syscall_filter() is not needed anymore, so it is removed, and seccomp_parse_syscall_filter_full() is renamed to take its place.	2019-04-03 09:17:42 +02:00
Lennart Poettering	3c27973b13	seccomp: introduce seccomp_restrict_suid_sgid() for blocking chmod() for suid/sgid files	2019-04-02 16:56:48 +02:00
Topi Miettinen	aecd5ac621	core: ProtectHostname= feature Let services use a private UTS namespace. In addition, a seccomp filter is installed on set{host,domain}name and a ro bind mounts on /proc/sys/kernel/{host,domain}name.	2019-02-20 10:50:44 +02:00
Zbigniew Jędrzejewski-Szmek	b54f36c604	seccomp: reduce logging about failure to add syscall to seccomp Our logs are full of: Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldstat() / -10037, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call get_thread_area() / -10076, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call set_thread_area() / -10079, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldfstat() / -10034, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldolduname() / -10036, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldlstat() / -10035, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call waitpid() / -10073, ignoring: Numerical argument out of domain ... This is pointless and makes debug logs hard to read. Let's keep the logs in test code, but disable it in nspawn and pid1. This is done through a function parameter because those functions operate recursively and it's not possible to make the caller to log meaningfully. There should be no functional change, except the skipped debug logs.	2018-09-24 17:21:09 +02:00
Lennart Poettering	705268414f	seccomp: add new system call filter, suitable as default whitelist for system services Currently we employ mostly system call blacklisting for our system services. Let's add a new system call filter group @system-service that helps turning this around into a whitelist by default. The new group is very similar to nspawn's default filter list, but in some ways more restricted (as sethostname() and suchlike shouldn't be available to most system services just like that) and in others more relaxed (for example @keyring is blocked in nspawn since it's not properly virtualized yet in the kernel, but is fine for regular system services).	2018-06-14 17:44:20 +02:00
Lennart Poettering	0c69794138	tree-wide: remove Lennart's copyright lines These lines are generally out-of-date, incomplete and unnecessary. With SPDX and git repository much more accurate and fine grained information about licensing and authorship is available, hence let's drop the per-file copyright notice. Of course, removing copyright lines of others is problematic, hence this commit only removes my own lines and leaves all others untouched. It might be nicer if sooner or later those could go away too, making git the only and accurate source of authorship information.	2018-06-14 10:20:20 +02:00
Lennart Poettering	818bf54632	tree-wide: drop 'This file is part of systemd' blurb This part of the copyright blurb stems from the GPL use recommendations: https://www.gnu.org/licenses/gpl-howto.en.html The concept appears to originate in times where version control was per file, instead of per tree, and was a way to glue the files together. Ultimately, we nowadays don't live in that world anymore, and this information is entirely useless anyway, as people are very welcome to copy these files into any projects they like, and they shouldn't have to change bits that are part of our copyright header for that. hence, let's just get rid of this old cruft, and shorten our codebase a bit.	2018-06-14 10:20:20 +02:00
Lennart Poettering	ef31828d06	tree-wide: unify how we define bit mak enums Let's always write "1 << 0", "1 << 1" and so on, except where we need more than 31 flag bits, where we write "UINT64(1) << 0", and so on to force 64bit values.	2018-06-12 21:44:00 +02:00
Zbigniew Jędrzejewski-Szmek	11a1589223	tree-wide: drop license boilerplate Files which are installed as-is (any .service and other unit files, .conf files, .policy files, etc), are left as is. My assumption is that SPDX identifiers are not yet that well known, so it's better to retain the extended header to avoid any doubt. I also kept any copyright lines. We can probably remove them, but it'd nice to obtain explicit acks from all involved authors before doing that.	2018-04-06 18:58:55 +02:00
Lennart Poettering	13d92c6300	seccomp: rework functions for parsing system call filters This reworks system call filter parsing, and replaces a couple of "bool" function arguments by a single flags parameter. This shouldn't change behaviour, except for one case: when we recursively call our parsing function on our own syscall list, then we'll lower the log level to LOG_DEBUG from LOG_WARNING, because at that point things are just a problem in our own code rather than in the user configuration we are parsing, and we shouldn't hence generate confusing warnings about syntax errors. Fixes: #8261	2018-02-27 19:59:09 +01:00
Yu Watanabe	898748d8b9	core,seccomp: fix logic to parse syscall filter in dbus-execute.c If multiple SystemCallFilter= settings, some of them are whitelist and the others are blacklist, are sent to bus, then the parse result was corrupted. This fixes the parse logic, now it is the same as one used in load-fragment.c	2017-12-23 18:45:32 +09:00
Zbigniew Jędrzejewski-Szmek	53e1b68390	Add SPDX license identifiers to source files under the LGPL This follows what the kernel is doing, c.f. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.	2017-11-19 19:08:15 +01:00
Yu Watanabe	8cfa775f4f	core: add support to specify errno in SystemCallFilter= This makes each system call in SystemCallFilter= blacklist optionally takes errno name or number after a colon. The errno takes precedence over the one given by SystemCallErrorNumber=. C.f. #7173. Closes #7169.	2017-11-11 21:54:12 +09:00
Lennart Poettering	44898c5358	seccomp: add three more seccomp groups @aio → asynchronous IO calls @sync → msync/fsync/... and friends @chown → changing file ownership (Also, change @privileged to reference @chown now, instead of the individual syscalls it contains)	2017-10-05 15:42:48 +02:00
Djalal Harouni	09d3020b0a	seccomp: remove '@credentials' syscall set (#6958 ) This removes the '@credentials' syscall set that was added in commit v234-468-gcd0ddf6f75. Most of these syscalls are so simple that we do not want to filter them. They work on the current calling process, doing only read operations, they do not have a deep kernel path. The problem may only be in 'capget' syscall since it can query arbitrary processes, and used to discover processes, however sending signal 0 to arbitrary processes can be used to discover if a process exists or not. It is unfortunate that Linux allows to query processes of different users. Lets put it now in '@process' syscall set, and later we may add it to a new '@basic-process' set that allows most basic process operations.	2017-10-03 07:20:05 +02:00
Lennart Poettering	cd0ddf6f75	seccomp: add four new syscall groups These groups should be useful shortcuts for sets of closely related syscalls where it usually makes more sense to allow them altogether or not at all.	2017-09-14 15:45:21 +02:00
Lennart Poettering	960e4569e1	nspawn: implement configurable syscall whitelisting/blacklisting Now that we have ported nspawn's seccomp code to the generic code in seccomp-util, let's extend it to support whitelisting and blacklisting of specific additional syscalls. This uses similar syntax as PID1's support for system call filtering, but in contrast to that always implements a blacklist (and not a whitelist), as we prepopulate the filter with a blacklist, and the unit's system call filter logic does not come with anything prepopulated. (Later on we might actually want to invert the logic here, and whitelist rather than blacklist things, but at this point let's not do that. In case we switch this over later, the syscall add/remove logic of this commit should be compatible conceptually.) Fixes: #5163 Replaces: #5944	2017-09-12 14:06:21 +02:00
Lennart Poettering	69b1b241bb	seccomp: split out inner loop code of seccomp_add_syscall_filter_set() Let's add a new helper function seccomp_add_syscall_filter_item() that contains the inner loop code of seccomp_add_syscall_filter_set(). This helper function we can then export and make use of elsewhere.	2017-09-11 18:00:07 +02:00
Topi Miettinen	78e864e5b3	seccomp: LockPersonality boolean (#6193 ) Add LockPersonality boolean to allow locking down personality(2) system call so that the execution domain can't be changed. This may be useful to improve security because odd emulations may be poorly tested and source of vulnerabilities, while system services shouldn't need any weird personalities.	2017-08-29 15:54:50 +02:00
Lennart Poettering	165a31c0db	core: add two new special ExecStart= character prefixes This patch adds two new special character prefixes to ExecStart= and friends, in addition to the existing "-", "@" and "+": "!" → much like "+", except with a much reduced effect as it only disables the actual setresuid()/setresgid()/setgroups() calls, but leaves all other security features on, including namespace options. This is very useful in combination with RuntimeDirectory= or DynamicUser= and similar option, as a user is still allocated and used for the runtime directory, but the actual UID/GID dropping is left to the daemon process itself. This should make RuntimeDirectory= a lot more useful for daemons which insist on doing their own privilege dropping. "!!" → Similar to "!", but on systems supporting ambient caps this becomes a NOP. This makes it relatively straightforward to write unit files that make use of ambient capabilities to let systemd drop all privs while retaining compatibility with systems that lack ambient caps, where priv dropping is the left to the daemon codes themselves. This is an alternative approach to #6564 and related PRs.	2017-08-10 15:04:32 +02:00
Lennart Poettering	6eaaeee93a	seccomp: add new @setuid seccomp group This new group lists all UID/GID credential changing syscalls (which are quite a number these days). This will become particularly useful in a later commit, which uses this group to optionally permit user credential changing to daemons in case ambient capabilities are not available.	2017-08-10 15:02:50 +02:00
Yu Watanabe	b16bd5350f	seccomp-util: add parse_syscall_archs()	2017-08-07 23:41:52 +09:00
Zbigniew Jędrzejewski-Szmek	da1921a5c3	seccomp: enable RestrictAddressFamilies on ppc64, autodetect SECCOMP_RESTRICT_ADDRESS_FAMILIES_BROKEN We expect that if socket() syscall is available, seccomp works for that architecture. So instead of explicitly listing all architectures where we know it is not available, just assume it is broken if the number is not defined. This should have the same effect, except that other architectures where it is also broken will pass tests without further changes. (Architectures where the filter should work, but does not work because of missing entries in seccomp-util.c, will still fail.) i386, s390, s390x are the exception — setting the filter fails, even though socket() is available, so it needs to be special-cased (https://github.com/systemd/systemd/issues/5215#issuecomment-277241488). This remove the last define in seccomp-util.h that was only used in test-seccomp.c. Porting the seccomp filter to new architectures should be simpler because now only two places need to be modified. RestrictAddressFamilies seems to work on ppc64[bl]e, so enable it (the tests pass).	2017-05-10 09:21:16 -04:00
Zbigniew Jędrzejewski-Szmek	511ceb1f8d	seccomp: assume clone() arg order is known on all architectures While adding the defines for arm, I realized that we have pretty much all known architectures covered, so SECCOMP_RESTRICT_NAMESPACES_BROKEN is not necessary anymore. clone(2) is adamant that the order of the first two arguments is only reversed on s390/s390x. So let's simplify things and remove the #if.	2017-05-07 20:01:04 -04:00
Zbigniew Jędrzejewski-Szmek	2a65bd94e4	seccomp: drop SECCOMP_MEMORY_DENY_WRITE_EXECUTE_BROKEN, add test for shmat SECCOMP_MEMORY_DENY_WRITE_EXECUTE_BROKEN was conflating two separate things: 1. whether shmat/shmdt/shmget can be filtered (if ipc multiplexer is used, they can not) 2. whether we know this for the current architecture For i386, shmat is implemented as ipc, so seccomp filter is "broken" for shmat, but not for mmap, and SECCOMP_MEMORY_DENY_WRITE_EXECUTE_BROKEN cannot be used to cover both cases. The define was only used for tests — not in the implementation in seccomp-util.c. So let's get rid of SECCOMP_MEMORY_DENY_WRITE_EXECUTE_BROKEN and encode the right condition directly in tests.	2017-05-07 18:59:37 -04:00
James Cowgill	a3645cc6dd	seccomp: add clone syscall definitions for mips (#5880 ) Also updates the documentation and adds a mention of ppc64 support which was enabled by #5325. Tested on Debian mipsel and mips64el. The other 4 mips architectures should have an identical user <-> kernel ABI to one of the 2 tested systems.	2017-05-03 18:35:45 +02:00
Zbigniew Jędrzejewski-Szmek	290f0ff9aa	Define clone order on ppc (#5325 ) This was tested on ppc64le. Assume the same is true for ppc64.	2017-02-14 11:27:40 +01:00
Lennart Poettering	ae9d60ce4e	seccomp: on s390 the clone() parameters are reversed Add a bit of code that tries to get the right parameter order in place for some of the better known architectures, and skips restrict_namespaces for other archs. This also bypasses the test on archs where we don't know the right order. In this case I didn't bother with testing the case where no filter is applied, since that is hopefully just an issue for now, as there's nothing stopping us from supporting more archs, we just need to know which order is right. Fixes: #5241	2017-02-08 22:21:27 +01:00
Lennart Poettering	8a50cf6957	seccomp: MemoryDenyWriteExecute= should affect both mmap() and mmap2() (#5254 ) On i386 we block the old mmap() call entirely, since we cannot properly filter it. Thankfully it hasn't been used by glibc since quite some time. Fixes: #5240	2017-02-08 15:14:02 +01:00
Lennart Poettering	ad8f1479b4	seccomp: RestrictAddressFamilies= is not supported on i386/s390/s390x, make it a NOP See: #5215	2017-02-06 14:17:12 +01:00
Lennart Poettering	469830d142	seccomp: rework seccomp code, to improve compat with some archs This substantially reworks the seccomp code, to ensure better compatibility with some architectures, including i386. So far we relied on libseccomp's internal handling of the multiple syscall ABIs supported on Linux. This is problematic however, as it does not define clear semantics if an ABI is not able to support specific seccomp rules we install. This rework hence changes a couple of things: - We no longer use seccomp_rule_add(), but only seccomp_rule_add_exact(), and fail the installation of a filter if the architecture doesn't support it. - We no longer rely on adding multiple syscall architectures to a single filter, but instead install a separate filter for each syscall architecture supported. This way, we can install a strict filter for x86-64, while permitting a less strict filter for i386. - All high-level filter additions are now moved from execute.c to seccomp-util.c, so that we can test them independently of the service execution logic. - Tests have been added for all types of our seccomp filters. - SystemCallFilters= and SystemCallArchitectures= are now implemented in independent filters and installation logic, as they semantically are very much independent of each other. Fixes: #4575	2017-01-17 22:14:27 -05:00
Lennart Poettering	bd2ab3f4f6	seccomp: add two new filter sets: @reboot and @swap These groupe reboot()/kexec() and swapon()/swapoff() respectively	2016-12-27 18:09:37 +01:00
Lennart Poettering	1a1b13c957	seccomp: add @filesystem syscall group (#4537 ) @filesystem groups various file system operations, such as opening files and directories for read/write and stat()ing them, plus renaming, deleting, symlinking, hardlinking.	2016-11-21 19:29:12 -05:00
Lennart Poettering	add005357d	core: add new RestrictNamespaces= unit file setting This new setting permits restricting whether namespaces may be created and managed by processes started by a unit. It installs a seccomp filter blocking certain invocations of unshare(), clone() and setns(). RestrictNamespaces=no is the default, and does not restrict namespaces in any way. RestrictNamespaces=yes takes away the ability to create or manage any kind of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces so that only mount and IPC namespaces may be created/managed, but no other kind of namespaces. This setting should be improve security quite a bit as in particular user namespacing was a major source of CVEs in the kernel in the past, and is accessible to unprivileged processes. With this setting the entire attack surface may be removed for system services that do not make use of namespaces.	2016-11-04 07:40:13 -06:00
Zbigniew Jędrzejewski-Szmek	d5efc18b60	seccomp-util, analyze: export comments as a help string Just to make the whole thing easier for users.	2016-11-03 09:35:36 -04:00
Zbigniew Jędrzejewski-Szmek	40eb6a8014	seccomp-util: move @default to the first position Now that the list is user-visible, @default should be first.	2016-11-03 09:35:36 -04:00
Lennart Poettering	133ddbbeae	seccomp: add two new syscall groups @resources contains various syscalls that alter resource limits and memory and scheduling parameters of processes. As such they are good candidates to block for most services. @basic-io contains a number of basic syscalls for I/O, similar to the list seccomp v1 permitted but slightly more complete. It should be useful for building basic whitelisting for minimal sandboxes	2016-11-02 08:50:00 -06:00
Lennart Poettering	f6281133de	seccomp: add test-seccomp test tool This validates the system call set table and many of our seccomp-util.c APIs.	2016-10-24 17:32:51 +02:00
Lennart Poettering	a3be2849b2	seccomp: add new helper call seccomp_load_filter_set() This allows us to unify most of the code in apply_protect_kernel_modules() and apply_private_devices().	2016-10-24 17:32:50 +02:00

1 2

59 commits