Systemd

Commit Graph

Author	SHA1	Message	Date
Lennart Poettering	d8b4d14df4	util: split out nulstr related stuff to nulstr-util.[ch]	2019-03-14 13:25:52 +01:00
Topi Miettinen	aecd5ac621	core: ProtectHostname= feature Let services use a private UTS namespace. In addition, a seccomp filter is installed on set{host,domain}name and a ro bind mounts on /proc/sys/kernel/{host,domain}name.	2019-02-20 10:50:44 +02:00
Lennart Poettering	57c03b1e6e	seccomp: drop mincore() from @system-service syscall filter group Previously, this system call was included in @system-service since it is a "getter" only, i.e. only queries information, and doesn't change anything, and hence was considered not risky. However, as it turns out, mincore() is actually security sensitive, see the discussion here: https://lwn.net/Articles/776034/ Hence, let's adjust the system call filter and drop mincore() from it. This constitues a compatibility break to some level, however I presume we can get away with this as the systemcall is pretty exotic. The fact that it is pretty exotic is also reflected by the fact that the kernel intends to majorly change behaviour of the system call soon (see the linked LWN article)	2019-01-16 18:08:35 +01:00
Lennart Poettering	ad5ffe3716	seccomp-util: drop process_vm_readv from @debug group it's already part of @ipc, no need to have it in both. Given that @ipc is much more popular (as it is part of @system-service for example), let's not define it a second time.	2018-11-30 16:46:09 +01:00
Zbigniew Jędrzejewski-Szmek	baaa35ad70	coccinelle: make use of SYNTHETIC_ERRNO Ideally, coccinelle would strip unnecessary braces too. But I do not see any option in coccinelle for this, so instead, I edited the patch text using search&replace to remove the braces. Unfortunately this is not fully automatic, in particular it didn't deal well with if-else-if-else blocks and ifdefs, so there is an increased likelikehood be some bugs in such spots. I also removed part of the patch that coccinelle generated for udev, where we returns -1 for failure. This should be fixed independently.	2018-11-22 10:54:38 +01:00
Lennart Poettering	a05cfe230f	seccomp: add some missing syscalls to filter sets	2018-11-16 16:10:57 +01:00
Zbigniew Jędrzejewski-Szmek	a90db619ca	shared: fix typo	2018-11-10 07:43:57 +01:00
Yu Watanabe	14cb109d45	tree-wide: replace 'unsigned int' with 'unsigned'	2018-10-19 22:19:12 +02:00
Zbigniew Jędrzejewski-Szmek	7e86bd73a4	seccomp: tighten checking of seccomp filter creation In seccomp code, the code is changed to propagate errors which are about anything other than unknown/unimplemented syscalls. I think such errors should not happen in normal usage, but so far we would summarilly ignore all errors, so that part is uncertain. If it turns out that other errors occur and should be ignored, this should be added later. In nspawn, we would count the number of added filters, but didn't use this for anything. Drop that part. The comments suggested that seccomp_add_syscall_filter_item() returned negative if the syscall is unknown, but this wasn't true: it returns 0. The error at this point can only be if the syscall was known but couldn't be added. If the error comes from our internal whitelist in nspawn, treat this as error, because it means that our internal table is wrong. If the error comes from user arguments, warn and ignore. (If some syscall is not known at current architecture, it is still silently ignored.)	2018-09-24 17:21:09 +02:00
Zbigniew Jędrzejewski-Szmek	b54f36c604	seccomp: reduce logging about failure to add syscall to seccomp Our logs are full of: Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldstat() / -10037, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call get_thread_area() / -10076, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call set_thread_area() / -10079, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldfstat() / -10034, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldolduname() / -10036, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldlstat() / -10035, ignoring: Numerical argument out of domain Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call waitpid() / -10073, ignoring: Numerical argument out of domain ... This is pointless and makes debug logs hard to read. Let's keep the logs in test code, but disable it in nspawn and pid1. This is done through a function parameter because those functions operate recursively and it's not possible to make the caller to log meaningfully. There should be no functional change, except the skipped debug logs.	2018-09-24 17:21:09 +02:00
Lucas Werkmeister	9d7fe7c65a	seccomp: permit specifying multiple errnos for a syscall If more than one errno is specified for a syscall in SystemCallFilter=, use the last one instead of reporting an error. This is especially useful when used with system call sets: SystemCallFilter=@privileged:EPERM @reboot This will block any system call requiring super-user capabilities with EPERM, except for attempts to reboot the system, which will immediately terminate the process. (@reboot is included in @privileged.) This also effectively fixes #9939, since specifying different errnos for “the same syscall” (same pseudo syscall number) is no longer an error.	2018-09-07 21:44:13 +02:00
Lucas Werkmeister	851ee70a3d	seccomp: improve error reporting Only report OOM if that was actually the error of the operation, explicitly report the possible error that a syscall was already blocked with a different errno and translate that into a more sensible errno (EEXIST only makes sense in connection to the hashmap), and pass through all other potential errors unmodified. Part of #9939.	2018-08-29 21:42:03 +02:00
Lion Yang	a9518dc369	seccomp: add swapcontext into @process for ppc32 There are some modern programming languages use userspace context switches to implement coroutine features. PowerPC (32-bit) needs syscall "swapcontext" to get contexts or switch between contexts, which is special. Adding this rule should fix #9485.	2018-07-03 13:35:02 +02:00
Lennart Poettering	e05ee49b14	seccomp: explain why we use setuid rather than @setuid in @privileged	2018-06-14 17:44:20 +02:00
Lennart Poettering	705268414f	seccomp: add new system call filter, suitable as default whitelist for system services Currently we employ mostly system call blacklisting for our system services. Let's add a new system call filter group @system-service that helps turning this around into a whitelist by default. The new group is very similar to nspawn's default filter list, but in some ways more restricted (as sethostname() and suchlike shouldn't be available to most system services just like that) and in others more relaxed (for example @keyring is blocked in nspawn since it's not properly virtualized yet in the kernel, but is fine for regular system services).	2018-06-14 17:44:20 +02:00
Lennart Poettering	0c69794138	tree-wide: remove Lennart's copyright lines These lines are generally out-of-date, incomplete and unnecessary. With SPDX and git repository much more accurate and fine grained information about licensing and authorship is available, hence let's drop the per-file copyright notice. Of course, removing copyright lines of others is problematic, hence this commit only removes my own lines and leaves all others untouched. It might be nicer if sooner or later those could go away too, making git the only and accurate source of authorship information.	2018-06-14 10:20:20 +02:00
Lennart Poettering	818bf54632	tree-wide: drop 'This file is part of systemd' blurb This part of the copyright blurb stems from the GPL use recommendations: https://www.gnu.org/licenses/gpl-howto.en.html The concept appears to originate in times where version control was per file, instead of per tree, and was a way to glue the files together. Ultimately, we nowadays don't live in that world anymore, and this information is entirely useless anyway, as people are very welcome to copy these files into any projects they like, and they shouldn't have to change bits that are part of our copyright header for that. hence, let's just get rid of this old cruft, and shorten our codebase a bit.	2018-06-14 10:20:20 +02:00
Yu Watanabe	86c2a9f1c2	nsflsgs: drop namespace_flag_{from,to}_string() This also drops namespace_flag_to_string_many_with_check(), and renames namespace_flag_{from,to}_string_many() to namespace_flags_{from,to}_string().	2018-05-05 11:07:37 +09:00
Zbigniew Jędrzejewski-Szmek	11a1589223	tree-wide: drop license boilerplate Files which are installed as-is (any .service and other unit files, .conf files, .policy files, etc), are left as is. My assumption is that SPDX identifiers are not yet that well known, so it's better to retain the extended header to avoid any doubt. I also kept any copyright lines. We can probably remove them, but it'd nice to obtain explicit acks from all involved authors before doing that.	2018-04-06 18:58:55 +02:00
Yu Watanabe	1cc6c93a95	tree-wide: use TAKE_PTR() and TAKE_FD() macros	2018-04-05 14:26:26 +09:00
James Cowgill	303d6b4ca6	Partially revert "seccomp: add mmap and address family restrictions for MIPS" (#8563 ) This reverts the mmap parts of `f5aeac1439`, but keeps the part which restricts address families which works correctly. Unfortunately the MIPS toolchains still do not implement PT_GNU_STACK. This means that while the commit to restrict mmap on MIPS was "correct", it had the side effect of causing pthread_create to fail because glibc tries to allocate an executable stack for new threads in the absense of PT_GNU_STACK. We should wait until PT_GNU_STACK is implemented in all the relevant parts of the toolchain (at least gcc and glibc) before enabling this again.	2018-03-23 16:04:16 +01:00
James Cowgill	f5aeac1439	seccomp: add mmap and address family restrictions for MIPS (#8547 )	2018-03-22 15:40:44 +01:00
Mathieu Malaterre	0d9fca76bb	seccomp: enable RestrictAddressFamilies on ppc (#8505 ) In commit `da1921a5c3` ppc64/ppc64el were added as supported architectures for socketcall() for the POWER family. Extend the support for the 32bits architectures.	2018-03-20 16:08:20 +01:00
Lennart Poettering	13d92c6300	seccomp: rework functions for parsing system call filters This reworks system call filter parsing, and replaces a couple of "bool" function arguments by a single flags parameter. This shouldn't change behaviour, except for one case: when we recursively call our parsing function on our own syscall list, then we'll lower the log level to LOG_DEBUG from LOG_WARNING, because at that point things are just a problem in our own code rather than in the user configuration we are parsing, and we shouldn't hence generate confusing warnings about syntax errors. Fixes: #8261	2018-02-27 19:59:09 +01:00
Alan Jenkins	2428aaf8a2	seccomp: allow x86-64 syscalls on x32, used by the VDSO (fix #8060 ) The VDSO provided by the kernel for x32, uses x86-64 syscalls instead of x32 ones. I think we can safely allow this; the set of x86-64 syscalls should be very similar to the x32 ones. The real point is not to allow x86 syscalls, because some of those are inconveniently multiplexed and we're apparently not able to block the specific actions we want to.	2018-02-02 18:12:34 +00:00
Alan Jenkins	5c19ff79de	seccomp-util: fix alarming debug message (#8002 , #8001 ) Booting with `systemd.log_level=debug` and looking in `dmesg -u` showed messages like this: systemd[433]: Failed to add rule for system call n/a() / 156, ignoring: Numerical argument out of domain This commit fixes it to: systemd[449]: Failed to add rule for system call _sysctl() / 156, ignoring: Numerical argument out of domain Some of the messages could be even more misleading, e.g. we were reporting that utimensat() / 320 was skipped as non-existent on x86, when actually the syscall number 320 is kexec_file_load() on x86 . The problem was that syscall NRs are looked up (and correctly passed to libseccomp) as native syscall NRs. But we forgot that when we tried to go back from the syscall NR to the name. I think the natural way to write this would be seccomp_syscall_resolve_num(nr), however there is no such function. I couldn't work out a short comment that would make this clearer. FWIW I wrote it up as a ticket for libseccomp instead. https://github.com/seccomp/libseccomp/issues/104	2018-01-31 17:20:14 +00:00
Lennart Poettering	7785da68e6	Merge pull request #7695 from yuwata/transient-socket DBus-API: implement transient socket unit	2017-12-23 19:20:29 +01:00
Yu Watanabe	898748d8b9	core,seccomp: fix logic to parse syscall filter in dbus-execute.c If multiple SystemCallFilter= settings, some of them are whitelist and the others are blacklist, are sent to bus, then the parse result was corrupted. This fixes the parse logic, now it is the same as one used in load-fragment.c	2017-12-23 18:45:32 +09:00
Mathieu Malaterre	63d00dfb64	shared/seccomp: add mmap handling for powerpc Also remove the warning: ./src/shared/seccomp-util.c:1414:2: warning: #warning "Consider adding the right mmap() syscall definitions here!" [-Wcpp] #warning "Consider adding the right mmap() syscall definitions here!"	2017-12-22 15:30:03 +01:00
Lennart Poettering	f1d34068ef	tree-wide: add DEBUG_LOGGING macro that checks whether debug logging is on (#7645 ) This makes things a bit easier to read I think, and also makes sure we always use the _unlikely_ wrapper around it, which so far we used sometimes and other times we didn't. Let's clean that up.	2017-12-15 11:09:00 +01:00
Zbigniew Jędrzejewski-Szmek	53e1b68390	Add SPDX license identifiers to source files under the LGPL This follows what the kernel is doing, c.f. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.	2017-11-19 19:08:15 +01:00
Zbigniew Jędrzejewski-Szmek	91691f1d3e	shared/seccomp: skip pkey_mprotect protections if the syscall is unknown When compiling with an old kernel on architectures for which the number is not defined in missing.h, a warning is generated in missing.h. Let's just skip the protection in this case, to allow build to proceed.	2017-11-13 09:35:49 +01:00
Zbigniew Jędrzejewski-Szmek	b835eeb4ec	shared/seccomp: disallow pkey_mprotect the same as mprotect for W^X mappings (#7295 ) MemoryDenyWriteExecution policy could be be bypassed by using pkey_mprotect instead of mprotect to create an executable writable mapping. The impact is mitigated by the fact that the man page says "Note that this feature is fully available on x86-64, and partially on x86", so hopefully people do not rely on it as a sole security measure. Found by Karin Hossen and Thomas Imbert from Sogeti ESEC R&D. https://bugs.launchpad.net/bugs/1725348	2017-11-12 17:28:48 +01:00
Lennart Poettering	ce5faeac1f	seccomp: include ARM set_tls in @default (#7297 ) Fixes: #7135	2017-11-12 16:34:43 +01:00
Yu Watanabe	8cfa775f4f	core: add support to specify errno in SystemCallFilter= This makes each system call in SystemCallFilter= blacklist optionally takes errno name or number after a colon. The errno takes precedence over the one given by SystemCallErrorNumber=. C.f. #7173. Closes #7169.	2017-11-11 21:54:12 +09:00
Antonio Rojas	8e6a7a8b2b	Fix typo in statx macro (#7180 ) This makes statx properly whitelisted in supported systems.	2017-11-10 11:07:36 +01:00
Lennart Poettering	af0f047ba8	seccomp: port @privileged to use @reboot + @swap Let's reuse two groups we already defined to make @privileged a bit shorter.	2017-10-05 15:42:48 +02:00
Lennart Poettering	e59608fa5f	seccomp: there is no "kexec" syscall it's called "kexec_load".	2017-10-05 15:42:48 +02:00
Lennart Poettering	44898c5358	seccomp: add three more seccomp groups @aio → asynchronous IO calls @sync → msync/fsync/... and friends @chown → changing file ownership (Also, change @privileged to reference @chown now, instead of the individual syscalls it contains)	2017-10-05 15:42:48 +02:00
Djalal Harouni	7c72bab4e3	seccomp: remove 'gettid' syscall from '@process' syscall set (#6989 ) The gettid syscall is one of the most basic syscalls, it never fails and it operates on current thread. Most applications are not suposed to use it, however even if it is used there is no much justification on blocking it. This patch removes it from '@process' set so if users blacklist this set to block setns or clone syscalls, the gettid syscall will still be available. Of course they can always block gettid explicitly. Note that the gettid is already in the '@default' set.	2017-10-05 14:46:41 +02:00
Lennart Poettering	448ac526a3	seccomp: ignore (and debug log) errors by all invocations of seccomp_rule_add_exact() System calls might exist on some archs but not on others, or might be multiplexed but not on others. Ignore such errors when putting together a filter at this location like we already do it on all others.	2017-10-05 11:27:34 +02:00
Lennart Poettering	1c6af69b2d	seccomp: always handle seccomp_load() failing the same way Unfortunately libseccomp doesn't return (nor document) clean error codes, hence until then only check for specific error codes that we propagate, but ignore (but debug log) all others. Do this at one more place, we are already doing that at all others.	2017-10-05 11:27:34 +02:00
Lennart Poettering	ff217dc3af	seccomp: react gracefully if we can't translate a syscall name When a libseccomp implementation doesn't know a syscall yet, that's no reason for us to fail completely. Instead, debug log, and proceed. This hopefully fixes the preadv2/pwritev2 issues pointed out here: https://github.com/systemd/systemd/pull/6952#issuecomment-334302923	2017-10-05 11:27:34 +02:00
Lennart Poettering	4c3a917617	seccomp: include prlimit64 and ugetrlimit in @default Also, move prlimit64() out of @resources. prlimit64() may be used both for getting and setting resource limits, and is implicitly called by glibc at various places, on some archs, the same was as getrlimit(). SImilar, igetrlimit() is an arch-specific replacement for getrlimit(), and hence should be whitelisted at the same place as getrlimit() and prlimit64(). Also see: https://lists.freedesktop.org/archives/systemd-devel/2017-September/039543.html	2017-10-05 11:27:34 +02:00
Djalal Harouni	8f44de08e9	seccomp: add sched_yield syscall to the @default syscall set	2017-10-04 10:41:42 +01:00
Djalal Harouni	09d3020b0a	seccomp: remove '@credentials' syscall set (#6958 ) This removes the '@credentials' syscall set that was added in commit v234-468-gcd0ddf6f75. Most of these syscalls are so simple that we do not want to filter them. They work on the current calling process, doing only read operations, they do not have a deep kernel path. The problem may only be in 'capget' syscall since it can query arbitrary processes, and used to discover processes, however sending signal 0 to arbitrary processes can be used to discover if a process exists or not. It is unfortunate that Linux allows to query processes of different users. Lets put it now in '@process' syscall set, and later we may add it to a new '@basic-process' set that allows most basic process operations.	2017-10-03 07:20:05 +02:00
Lennart Poettering	cff7bff880	seccomp: improve debug logging Let's log explicitly at debug level if we encounter a syscall or group that doesn#t exist at all.	2017-09-14 15:45:21 +02:00
Lennart Poettering	cd0ddf6f75	seccomp: add four new syscall groups These groups should be useful shortcuts for sets of closely related syscalls where it usually makes more sense to allow them altogether or not at all.	2017-09-14 15:45:21 +02:00
Lennart Poettering	0963c053fa	seccomp: augment the @resources group a bit Given that sched_setattr/sched_setparam/sched_setscheduler are already in the group the closely related nice + ioprio_set should also be included. Also, order things alphabetically.	2017-09-14 15:45:21 +02:00
Lennart Poettering	b887d2ebfe	seccomp: beef up @process group a bit Include the waid syscalls. If we permit forking then we should also permit waiting for a process. Similar to that: also permit determining the usage counters for processes. Include calls to determine process/thread identity. They have little impact security-wise, but are very likely used when process management of any form is done. Also, add rt_sigqueueinfo + rt_tgsigqueueinfo as they are similar to kill() and friends, but permit passing along a userdata ptr.	2017-09-14 15:45:21 +02:00
Lennart Poettering	7e0c3b8fda	seccomp: "idle" is another obsolete syscall	2017-09-14 15:45:21 +02:00
Lennart Poettering	215728ff39	seccomp: order the syscalls in more groups alphabetically No changes besides reordering.	2017-09-14 15:45:21 +02:00
Lennart Poettering	ceaa6aa76b	seccomp: let's update @file-system a bit Let's add fremovexattr which was the only xattr syscall so far missing from the group, even though lremovexattr and friends where included. Add inotify_init, which is an older (but still supported) version of inotify_init1. Add oldfstat, oldlstat, oldstat which are old versions of the stat syscalls on some archs. Add utime, which is an older more limited version of utimes and utimensat. Enclose the "statx" entry in some ifdeffery to ensure libseccomp actually knows the syscall. If libseccomp doesn't know it, then we'd get EINVAL rather than EDOM (which is what is returned if a syscall is known but not available on the local system) when resolving the syscall name and we really don't want that, as we use the EDOM vs. EINVAL check for determining whether a syscall makes sense at all. Also, order things alphabetically.	2017-09-14 15:45:21 +02:00
Lennart Poettering	648a0ed0d7	seccomp: let's update base-io a bit Let's add _llseek which is the syscall name on some archs that on others is simply lseek (due to 64bit vs 32bit off_t confusion). Also, let's sort things alphabetically.	2017-09-14 15:45:21 +02:00
Lennart Poettering	e41b0f42a8	seccomp: update "@default" seccomp group a bit Let's add more of the most basic operations to "@default" as absolute baseline needed by glibc and such to operate. Specifically: futex, get_robust_list, get_thread_area, membarrier, set_robust_list, set_thread_area, set_tid_address are all required to properly implement mutexes and other thread synchronization logic. Given that a ton of datastructures are protected by mutexes (such as stdio and such), let's just whitelist this by default, so that things can just work. restart_syscall is used to implement EAGAIN SA_RESTART stuff in some archs, and synthesized by the kernel without any explicit user logic, hence let's make this work out of the box.	2017-09-14 15:45:21 +02:00
Lennart Poettering	960e4569e1	nspawn: implement configurable syscall whitelisting/blacklisting Now that we have ported nspawn's seccomp code to the generic code in seccomp-util, let's extend it to support whitelisting and blacklisting of specific additional syscalls. This uses similar syntax as PID1's support for system call filtering, but in contrast to that always implements a blacklist (and not a whitelist), as we prepopulate the filter with a blacklist, and the unit's system call filter logic does not come with anything prepopulated. (Later on we might actually want to invert the logic here, and whitelist rather than blacklist things, but at this point let's not do that. In case we switch this over later, the syscall add/remove logic of this commit should be compatible conceptually.) Fixes: #5163 Replaces: #5944	2017-09-12 14:06:21 +02:00
Lennart Poettering	69b1b241bb	seccomp: split out inner loop code of seccomp_add_syscall_filter_set() Let's add a new helper function seccomp_add_syscall_filter_item() that contains the inner loop code of seccomp_add_syscall_filter_set(). This helper function we can then export and make use of elsewhere.	2017-09-11 18:00:07 +02:00
Lennart Poettering	12dc378902	seccomp: drop default_action parameter from seccomp_add_syscall_filter_set() The function doesn't actually use the parameter, hence let's drop it.	2017-09-11 18:00:07 +02:00
Lennart Poettering	a4135a742d	shared: add statx(2) to @file-system syscall filter list (#6738 )	2017-09-04 15:35:35 +02:00
Lennart Poettering	72eafe7159	seccomp: rework seccomp_lock_personality() to apply filter to all archs	2017-08-29 15:58:13 +02:00
Topi Miettinen	78e864e5b3	seccomp: LockPersonality boolean (#6193 ) Add LockPersonality boolean to allow locking down personality(2) system call so that the execution domain can't be changed. This may be useful to improve security because odd emulations may be poorly tested and source of vulnerabilities, while system services shouldn't need any weird personalities.	2017-08-29 15:54:50 +02:00
Lennart Poettering	165a31c0db	core: add two new special ExecStart= character prefixes This patch adds two new special character prefixes to ExecStart= and friends, in addition to the existing "-", "@" and "+": "!" → much like "+", except with a much reduced effect as it only disables the actual setresuid()/setresgid()/setgroups() calls, but leaves all other security features on, including namespace options. This is very useful in combination with RuntimeDirectory= or DynamicUser= and similar option, as a user is still allocated and used for the runtime directory, but the actual UID/GID dropping is left to the daemon process itself. This should make RuntimeDirectory= a lot more useful for daemons which insist on doing their own privilege dropping. "!!" → Similar to "!", but on systems supporting ambient caps this becomes a NOP. This makes it relatively straightforward to write unit files that make use of ambient capabilities to let systemd drop all privs while retaining compatibility with systems that lack ambient caps, where priv dropping is the left to the daemon codes themselves. This is an alternative approach to #6564 and related PRs.	2017-08-10 15:04:32 +02:00
Lennart Poettering	6eaaeee93a	seccomp: add new @setuid seccomp group This new group lists all UID/GID credential changing syscalls (which are quite a number these days). This will become particularly useful in a later commit, which uses this group to optionally permit user credential changing to daemons in case ambient capabilities are not available.	2017-08-10 15:02:50 +02:00
Yu Watanabe	b16bd5350f	seccomp-util: add parse_syscall_archs()	2017-08-07 23:41:52 +09:00
Zbigniew Jędrzejewski-Szmek	79873bc850	seccomp: arm64 does not have mmap2 I messed up when adding the definitions in `4278d1f531`. Unfortunately I didn't have the hardware at hand and went by looking at the kernel headers. (cherry picked from commit 53196fafcb7b24b45ed4f48ab894d00a24a6d871)	2017-07-15 17:18:22 -04:00
Zbigniew Jędrzejewski-Szmek	2e64e8f46d	seccomp: arm64/x32 do not have _sysctl So don't even try to added the filter to reduce noise. The test is updated to skip calling _sysctl because the kernel prints an oops-like message that is confusing and unhelpful: Jul 15 21:07:01 rpi3 kernel: test-seccomp[8448]: syscall -10080 Jul 15 21:07:01 rpi3 kernel: Code: aa0503e4 aa0603e5 aa0703e6 d4000001 (b13ffc1f) Jul 15 21:07:01 rpi3 kernel: CPU: 3 PID: 8448 Comm: test-seccomp Tainted: G W 4.11.8-300.fc26.aarch64 #1 Jul 15 21:07:01 rpi3 kernel: Hardware name: raspberrypi rpi/rpi, BIOS 2017.05 06/24/2017 Jul 15 21:07:01 rpi3 kernel: task: ffff80002bb0bb00 task.stack: ffff800036354000 Jul 15 21:07:01 rpi3 kernel: PC is at 0xffff8669c7c4 Jul 15 21:07:01 rpi3 kernel: LR is at 0xaaaac64b6750 Jul 15 21:07:01 rpi3 kernel: pc : [<0000ffff8669c7c4>] lr : [<0000aaaac64b6750>] pstate: 60000000 Jul 15 21:07:01 rpi3 kernel: sp : 0000ffffdc640fd0 Jul 15 21:07:01 rpi3 kernel: x29: 0000ffffdc640fd0 x28: 0000000000000000 Jul 15 21:07:01 rpi3 kernel: x27: 0000000000000000 x26: 0000000000000000 Jul 15 21:07:01 rpi3 kernel: x25: 0000000000000000 x24: 0000000000000000 Jul 15 21:07:01 rpi3 kernel: x23: 0000000000000000 x22: 0000000000000000 Jul 15 21:07:01 rpi3 kernel: x21: 0000aaaac64b4940 x20: 0000000000000000 Jul 15 21:07:01 rpi3 kernel: x19: 0000aaaac64b88f8 x18: 0000000000000020 Jul 15 21:07:01 rpi3 kernel: x17: 0000ffff8669c7a0 x16: 0000aaaac64d2ee0 Jul 15 21:07:01 rpi3 kernel: x15: 0000000000000000 x14: 0000000000000000 Jul 15 21:07:01 rpi3 kernel: x13: 203a657275746365 x12: 0000000000000000 Jul 15 21:07:01 rpi3 kernel: x11: 0000ffffdc640418 x10: 0000000000000000 Jul 15 21:07:01 rpi3 kernel: x9 : 0000000000000005 x8 : 00000000ffffd8a0 Jul 15 21:07:01 rpi3 kernel: x7 : 7f7f7f7f7f7f7f7f x6 : 7f7f7f7f7f7f7f7f Jul 15 21:07:01 rpi3 kernel: x5 : 65736d68716f7277 x4 : 0000000000000000 Jul 15 21:07:01 rpi3 kernel: x3 : 0000000000000008 x2 : 0000000000000000 Jul 15 21:07:01 rpi3 kernel: x1 : 0000000000000000 x0 : 0000000000000000 Jul 15 21:07:01 rpi3 kernel: (cherry picked from commit 1e20e640132c700c23494bb9e2619afb83878380)	2017-07-15 17:18:22 -04:00
Zbigniew Jędrzejewski-Szmek	e7854c46be	shared/seccomp-util: add parentheses and no. after syscall name "Failed to add rule for system call access, ignoring: Numerical argument out of domain" is confusing. Make that "... system call access() / 238". (cherry picked from commit 977dc6ca5acb8069a2966ec63e7378576bc2ca51)	2017-07-15 17:18:22 -04:00
Zbigniew Jędrzejewski-Szmek	da1921a5c3	seccomp: enable RestrictAddressFamilies on ppc64, autodetect SECCOMP_RESTRICT_ADDRESS_FAMILIES_BROKEN We expect that if socket() syscall is available, seccomp works for that architecture. So instead of explicitly listing all architectures where we know it is not available, just assume it is broken if the number is not defined. This should have the same effect, except that other architectures where it is also broken will pass tests without further changes. (Architectures where the filter should work, but does not work because of missing entries in seccomp-util.c, will still fail.) i386, s390, s390x are the exception — setting the filter fails, even though socket() is available, so it needs to be special-cased (https://github.com/systemd/systemd/issues/5215#issuecomment-277241488). This remove the last define in seccomp-util.h that was only used in test-seccomp.c. Porting the seccomp filter to new architectures should be simpler because now only two places need to be modified. RestrictAddressFamilies seems to work on ppc64[bl]e, so enable it (the tests pass).	2017-05-10 09:21:16 -04:00
Zbigniew Jędrzejewski-Szmek	511ceb1f8d	seccomp: assume clone() arg order is known on all architectures While adding the defines for arm, I realized that we have pretty much all known architectures covered, so SECCOMP_RESTRICT_NAMESPACES_BROKEN is not necessary anymore. clone(2) is adamant that the order of the first two arguments is only reversed on s390/s390x. So let's simplify things and remove the #if.	2017-05-07 20:01:04 -04:00
Zbigniew Jędrzejewski-Szmek	4278d1f531	seccomp: add mmap/shmat defines for arm and arm64	2017-05-07 20:01:04 -04:00
Zbigniew Jędrzejewski-Szmek	2a8d6e6395	seccomp: add mmap/shmat defines for ppc64	2017-05-07 20:01:04 -04:00
Zbigniew Jędrzejewski-Szmek	6dc666886a	seccomp: factor out seccomp_rule_add_exact to a helper function	2017-05-07 19:01:11 -04:00
James Cowgill	a3645cc6dd	seccomp: add clone syscall definitions for mips (#5880 ) Also updates the documentation and adds a mention of ppc64 support which was enabled by #5325. Tested on Debian mipsel and mips64el. The other 4 mips architectures should have an identical user <-> kernel ABI to one of the 2 tested systems.	2017-05-03 18:35:45 +02:00
Zbigniew Jędrzejewski-Szmek	290f0ff9aa	Define clone order on ppc (#5325 ) This was tested on ppc64le. Assume the same is true for ppc64.	2017-02-14 11:27:40 +01:00
Lennart Poettering	9606bc4b4b	seccomp: disable RestrictAddressFamilies= for the ABI we shall block, not the one we are compiled for (#5272 ) It's a difference. Not a big one, but let's be correct here.	2017-02-12 15:25:40 -05:00
Lennart Poettering	f2d9751c59	seccomp: order seccomp ABI list, so that our native ABI comes last (#5306 ) this way, we can still call seccomp ourselves, even if seccomp() is blocked by the filter we are installing. Fixes: #5300	2017-02-10 23:47:50 +01:00
Lennart Poettering	7961116e2c	seccomp: add forgotten munmap() syscall to @file-system (#5291 ) We added mmap() and mmap2(), but forgot munmap(). Fix that. Pointed out by @lucaswerkmeister: https://github.com/systemd/systemd/pull/4537#issuecomment-273275298	2017-02-09 21:29:33 -05:00
Lennart Poettering	ae9d60ce4e	seccomp: on s390 the clone() parameters are reversed Add a bit of code that tries to get the right parameter order in place for some of the better known architectures, and skips restrict_namespaces for other archs. This also bypasses the test on archs where we don't know the right order. In this case I didn't bother with testing the case where no filter is applied, since that is hopefully just an issue for now, as there's nothing stopping us from supporting more archs, we just need to know which order is right. Fixes: #5241	2017-02-08 22:21:27 +01:00
Lennart Poettering	8a50cf6957	seccomp: MemoryDenyWriteExecute= should affect both mmap() and mmap2() (#5254 ) On i386 we block the old mmap() call entirely, since we cannot properly filter it. Thankfully it hasn't been used by glibc since quite some time. Fixes: #5240	2017-02-08 15:14:02 +01:00
Lennart Poettering	ad8f1479b4	seccomp: RestrictAddressFamilies= is not supported on i386/s390/s390x, make it a NOP See: #5215	2017-02-06 14:17:12 +01:00
Evgeny Vereshchagin	1b52793d5d	seccomp: don't ever try to add an ABI before removing the default native ABI (#5230 ) https://github.com/systemd/systemd/issues/5215#issuecomment-277156262 libseccomp does not allow you to add architectures to a filter that doesn't match the byte ordering of the architectures already added to the filter (it would be a mess, not to mention largely pointless) and since systemd attempts to add an ABI before removing the default native ABI, you will always fail on Power (either due to ppc or ppc64le). The fix is to remove the native ABI before adding a new ABI so you don't run into problems with byte ordering. You would likely see the same failure on a MIPS system. Thanks @pcmoore!	2017-02-05 11:58:19 -05:00
Lennart Poettering	4d5bd50ab2	seccomp: minor simplifications for is_seccomp_available()	2017-01-17 22:14:27 -05:00
Lennart Poettering	469830d142	seccomp: rework seccomp code, to improve compat with some archs This substantially reworks the seccomp code, to ensure better compatibility with some architectures, including i386. So far we relied on libseccomp's internal handling of the multiple syscall ABIs supported on Linux. This is problematic however, as it does not define clear semantics if an ABI is not able to support specific seccomp rules we install. This rework hence changes a couple of things: - We no longer use seccomp_rule_add(), but only seccomp_rule_add_exact(), and fail the installation of a filter if the architecture doesn't support it. - We no longer rely on adding multiple syscall architectures to a single filter, but instead install a separate filter for each syscall architecture supported. This way, we can install a strict filter for x86-64, while permitting a less strict filter for i386. - All high-level filter additions are now moved from execute.c to seccomp-util.c, so that we can test them independently of the service execution logic. - Tests have been added for all types of our seccomp filters. - SystemCallFilters= and SystemCallArchitectures= are now implemented in independent filters and installation logic, as they semantically are very much independent of each other. Fixes: #4575	2017-01-17 22:14:27 -05:00
Lennart Poettering	802fa07a4a	seccomp: move bdflush() system call to @obsolete filter group The system call is obsolete after all.	2016-12-27 18:09:37 +01:00
Lennart Poettering	58a8f68be0	seccomp: add proper help string for @resources seccomp filter set	2016-12-27 18:09:37 +01:00
Lennart Poettering	bd2ab3f4f6	seccomp: add two new filter sets: @reboot and @swap These groupe reboot()/kexec() and swapon()/swapoff() respectively	2016-12-27 18:09:37 +01:00
Lennart Poettering	1a1b13c957	seccomp: add @filesystem syscall group (#4537 ) @filesystem groups various file system operations, such as opening files and directories for read/write and stat()ing them, plus renaming, deleting, symlinking, hardlinking.	2016-11-21 19:29:12 -05:00
Lennart Poettering	add005357d	core: add new RestrictNamespaces= unit file setting This new setting permits restricting whether namespaces may be created and managed by processes started by a unit. It installs a seccomp filter blocking certain invocations of unshare(), clone() and setns(). RestrictNamespaces=no is the default, and does not restrict namespaces in any way. RestrictNamespaces=yes takes away the ability to create or manage any kind of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces so that only mount and IPC namespaces may be created/managed, but no other kind of namespaces. This setting should be improve security quite a bit as in particular user namespacing was a major source of CVEs in the kernel in the past, and is accessible to unprivileged processes. With this setting the entire attack surface may be removed for system services that do not make use of namespaces.	2016-11-04 07:40:13 -06:00
Zbigniew Jędrzejewski-Szmek	d5efc18b60	seccomp-util, analyze: export comments as a help string Just to make the whole thing easier for users.	2016-11-03 09:35:36 -04:00
Zbigniew Jędrzejewski-Szmek	40eb6a8014	seccomp-util: move @default to the first position Now that the list is user-visible, @default should be first.	2016-11-03 09:35:36 -04:00
Lennart Poettering	133ddbbeae	seccomp: add two new syscall groups @resources contains various syscalls that alter resource limits and memory and scheduling parameters of processes. As such they are good candidates to block for most services. @basic-io contains a number of basic syscalls for I/O, similar to the list seccomp v1 permitted but slightly more complete. It should be useful for building basic whitelisting for minimal sandboxes	2016-11-02 08:50:00 -06:00
Lennart Poettering	cd5bfd7e60	seccomp: include pipes and memfd in @ipc These system calls clearly fall in the @ipc category, hence should be listed there, simply to avoid confusion and surprise by the user.	2016-11-02 08:50:00 -06:00
Lennart Poettering	a8c157ff30	seccomp: drop execve() from @process list The system call is already part in @default hence implicitly allowed anyway. Also, if it is actually blocked then systemd couldn't execute the service in question anymore, since the application of seccomp is immediately followed by it.	2016-11-02 08:49:59 -06:00
Lennart Poettering	c79aff9a82	seccomp: add clock query and sleeping syscalls to "@default" group Timing and sleep are so basic operations, it makes very little sense to ever block them, hence don't.	2016-11-02 08:49:59 -06:00
Zbigniew Jędrzejewski-Szmek	aa34055ffb	seccomp: allow specifying arm64, mips, ppc (#4491 ) "Secondary arch" table for mips is entirely speculative…	2016-11-01 09:33:18 -06:00
Lennart Poettering	a3be2849b2	seccomp: add new helper call seccomp_load_filter_set() This allows us to unify most of the code in apply_protect_kernel_modules() and apply_private_devices().	2016-10-24 17:32:50 +02:00
Lennart Poettering	60f547cf68	seccomp: two fixes for the syscall set tables "oldumount()" is not a syscall, but simply a wrapper for it, the actual syscall nr is called "umount" (and the nr of umount() is called umount2 internally). "sysctl()" is not a syscall, but "_syscall()" is. Fix this in the table. Without these changes libseccomp cannot actually translate the tables in full. This wasn't noticed before as the code was written defensively for this case.	2016-10-24 17:32:50 +02:00
Lennart Poettering	8d7b0c8fd7	seccomp: add new seccomp_init_conservative() helper This adds a new seccomp_init_conservative() helper call that is mostly just a wrapper around seccomp_init(), but turns off NNP and adds in all secondary archs, for best compatibility with everything else. Pretty much all of our code used the very same constructs for these three steps, hence unifying this in one small function makes things a lot shorter. This also changes incorrect usage of the "scmp_filter_ctx" type at various places. libseccomp defines it as typedef to "void", i.e. it is a pointer type (pretty poor choice already!) that casts implicitly to and from all other pointer types (even poorer choice: you defined a confusing type now, and don't even gain any bit of type safety through it...). A lot of the code assumed the type would refer to a structure, and hence aded additional "" here and there. Remove that.	2016-10-24 17:32:50 +02:00
Lennart Poettering	8130926d32	core: rework syscall filter set handling A variety of fixes: - rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main instance of it (the syscall_filter_sets[] array) used to abbreviate "SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not mix and match too wildly. Let's pick the shorter name in this case, as it is sufficiently well established to not confuse hackers reading this. - Export explicit indexes into the syscall_filter_sets[] array via an enum. This way, code that wants to make use of a specific filter set, can index it directly via the enum, instead of having to search for it. This makes apply_private_devices() in particular a lot simpler. - Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to find a set by its name, seccomp_add_syscall_filter_set() to add a set to a seccomp object. - Update SystemCallFilter= parser to use extract_first_word(). Let's work on deprecating FOREACH_WORD_QUOTED(). - Simplify apply_private_devices() using this functionality	2016-10-24 17:32:50 +02:00
hbrueckner	6abfd30372	seccomp: add support for the s390 architecture (#4287 ) Add seccomp support for the s390 architecture (31-bit and 64-bit) to systemd. This requires libseccomp >= 2.3.1.	2016-10-05 13:58:55 +02:00

1 2 3 4

162 Commits