This new unit settings allows restricting which address families are
available to processes. This is an effective way to minimize the attack
surface of services, by turning off entire network stacks for them.
This is based on seccomp, and does not work on x86-32, since seccomp
cannot filter socketcall() syscalls on that platform.
According to Wikipedia it is customary to specify hardware metrics and
transfer speeds to the basis 1000 (SI decimal), while software metrics
and physical volatile memory (RAM) sizes to the basis 1024 (IEC binary).
So far we specified everything in IEC, let's fix that and be more
true to what's otherwise customary. Since we don't want to parse "Mi"
instead of "M" we document each time what the context used is.
This permit to switch to a specific apparmor profile when starting a daemon. This
will result in a non operation if apparmor is disabled.
It also add a new build requirement on libapparmor for using this feature.
This is useful to prohibit execution of non-native processes on systems,
for example 32bit binaries on 64bit systems, this lowering the attack
service on incorrect syscall and ioctl 32→64bit mappings.
- Allow configuration of an errno error to return from blacklisted
syscalls, instead of immediately terminating a process.
- Fix parsing logic when libseccomp support is turned off
- Only keep the actual syscall set in the ExecContext, and generate the
string version only on demand.
Previously we'd open the connection in the originating namespace, which
meant most peers of the bus would not be able to make sense of the
PID/UID/... identity of us since we didn't exist in the namespace they
run in. However they require this identity for privilege decisions,
hence disallowing access to anything from the host.
Instead, when connecting to a container, create a temporary subprocess,
make it join the container's namespace and then connect from there to
the kdbus instance. This is similar to how we do it for socket
conections already.
THis also unifies the namespacing code used by machinectl and the bus
APIs.
The only problem is that libgen.h #defines basename to point to it's
own broken implementation instead of the GNU one. This can be fixed
by #undefining basename.
Pass on the line on which a section was decleared to the parsers, so they
can distinguish between multiple sections (if they chose to). Currently
no parsers take advantage of this, but a follow-up patch will do that
to distinguish
[Address]
Address=192.168.0.1/24
Label=one
[Address]
Address=192.168.0.2/24
Label=two
from
[Address]
Address=192.168.0.1/24
Label=one
Address=192.168.0.2/24
Label=two
This patch converts PID 1 to libsystemd-bus and thus drops the
dependency on libdbus. The only remaining code using libdbus is a test
case that validates our bus marshalling against libdbus' marshalling,
and this dependency can be turned off.
This patch also adds a couple of things to libsystem-bus, that are
necessary to make the port work:
- Synthesizing of "Disconnected" messages when bus connections are
severed.
- Support for attaching multiple vtables for the same interface on the
same path.
This patch also fixes the SetDefaultTarget() and GetDefaultTarget() bus
calls which used an inappropriate signature.
As a side effect we will now generate PropertiesChanged messages which
carry property contents, rather than just invalidation information.
We now treat passno as boleans in the generators, and don't need this any more. fsck itself
is able to sequentialize checks on the same local media, so in the common case the ordering
is redundant.
It is still possible to force an order by using .d fragments, in case that is desired.
The code was actually safe, because b should
never be null, because if rvalue is empty, a different
branch is taken. But we *do* check for NULL in the
loop above, so it's better to also check here for symmetry.
Previously to automatically create dependencies between mount units we
matched every mount unit agains all others resulting in O(n^2)
complexity. On setups with large amounts of mount units this might make
things slow.
This change replaces the matching code to use a hashtable that is keyed
by a path prefix, and points to a set of units that require that path to
be around. When a new mount unit is installed it is hence sufficient to
simply look up this set of units via its own file system paths to know
which units to order after itself.
This patch also changes all unit types to only create automatic mount
dependencies via the RequiresMountsFor= logic, and this is exposed to
the outside to make things more transparent.
With this change we still have some O(n) complexities in place when
handling mounts, but that's currently unavoidable due to kernel APIs,
and still substantially better than O(n^2) as before.
https://bugs.freedesktop.org/show_bug.cgi?id=69740
The cgroup attribute memory.soft_limit_in_bytes is unlikely to stay
around in the kernel for good, so let's not expose it for now. We can
readd something like it later when the kernel guys decided on a final
API for this.
Previously the specifier calls could only indicate OOM by returning
NULL. With this change they will return negative errno-style error codes
like everything else.
BlockIOReadBandwidth and BlockIOWriteBandwidth both use
config_parse_blockio_bandwidth to set up CGroupBlockIODeviceBandwidth,
We should set the read value based on the left values
in config files.
Transient units can be created via the bus API. They are configured via
the method call parameters rather than on-disk files. They are subject
to normal GC. Transient units currently may only be created for
services (however, we will extend this), and currently only ExecStart=
and the cgroup parameters can be configured (also to be extended).
Transient units require a unique name, that previously had no
configuration file on disk.
A tool systemd-run is added that makes use of this functionality to run
arbitrary command lines as transient services:
$ systemd-run /bin/ping www.heise.de
Will cause systemd to create a new transient service and run ping in it.
Replace the very generic cgroup hookup with a much simpler one. With
this change only the high-level cgroup settings remain, the ability to
set arbitrary cgroup attributes is removed, so is support for adding
units to arbitrary cgroup controllers or setting arbitrary paths for
them (especially paths that are different for the various controllers).
This also introduces a new -.slice root slice, that is the parent of
system.slice and friends. This enables easy admin configuration of
root-level cgrouo properties.
This replaces DeviceDeny= by DevicePolicy=, and implicitly adds in
/dev/null, /dev/zero and friends if DeviceAllow= is used (unless this is
turned off by DevicePolicy=).
This complements existing functionality of setting variables
through 'systemctl set-environment', the kernel command line,
and through normal environment variables for systemd in session
mode.
In order to prepare for the kernel cgroup rework, let's introduce a new
unit type to systemd, the "slice". Slices can be arranged in a tree and
are useful to partition resources freely and hierarchally by the user.
Each service unit can now be assigned to one of these slices, and later
on login users and machines may too.
Slices translate pretty directly to the cgroup hierarchy, and the
various objects can be assigned to any of the slices in the tree.
Instead of having explicit type-specific callbacks that inform the
triggering unit when a triggered unit changes state, make this generic
so that state changes are forwarded betwee any triggered and triggering
unit.
Also, get rid of UnitRef references from automount, timer, path units,
to the units they trigger and rely exclsuively on UNIT_TRIGGER type
dendencies.
The information about the unit for which files are being parsed
is passed all the way down. This way messages land in the journal
with proper UNIT=... or USER_UNIT=... attribution.
'systemctl status' and 'journalctl -u' not displaying those messages
has been a source of confusion for users, since the journal entry for
a misspelt setting was often logged quite a bit earlier than the
failure to start a unit.
Based-on-a-patch-by: Oleksii Shevchuk <alxchk@gmail.com>
Internally we store all time values in usec_t, however parse_usec()
actually was used mostly to parse values in seconds (unless explicit
units were specified to define a different unit). Hence, be clear about
this and name the function about what we pass into it, not what we get
out of it.
Previously, it would set all caps, but it should drop them all, anything
else makes little sense.
Also, document that this works as it does, and what to do in order to
assign all caps to the bounding set.
https://bugzilla.redhat.com/show_bug.cgi?id=914705
This introduces a new static list of known attributes and their special
semantics. This means that cgroup attribute values can now be
automatically translated from user to kernel notation for command line
set settings, too.
This also adds proper support for multi-line attributes.
In the x32 ABI, syscall numbers start at 0x40000000. Mask that bit on
x32 for lookups in the syscall_names array and syscall_filter and ensure
that syscall.h is parsed correctly.
[zj: added SYSCALL_TO_INDEX, INDEX_TO_SYSCALL macros.]
strncmp() could be used with size bigger then the size of the string,
because MAX was used instead of MIN.
If failing, print just the offending mount flag.
Whenever a message fails, mention the offending word, instead
of just giving the whole line. If one bad word causes just this
word to be rejected, print only the word. If one bad word causes
the whole line to be rejected, print the whole line too.
https://bugs.freedesktop.org/show_bug.cgi?id=56874
A service that only sets the scheduling policy to round-robin
fails to be started. This is because the cpu_sched_priority is
initialized to 0 and is not adjusted when the policy is changed.
Clamp the cpu_sched_priority when the scheduler policy is set. Use
the current policy to validate the new priority.
Change the manual page to state that the given range only applies
to the real-time scheduling policies.
Add a testcase that verifies this change:
$ make test-sched-prio; ./test-sched-prio
[test/sched_idle_bad.service:6] CPU scheduling priority is out of range, ignoring: 1
[test/sched_rr_bad.service:7] CPU scheduling priority is out of range, ignoring: 0
[test/sched_rr_bad.service:8] CPU scheduling priority is out of range, ignoring: 100
The behaviour of the common name##_from_string conversion is surprising.
It accepts not only the strings from name##_table but also any number
that falls within the range of the table. The order of items in most of
our tables is an internal affair. It should not be visible to the user.
I know of a case where the surprising numeric conversion leads to a crash.
We will allow the direct numeric conversion only for the tables where the
mapping of strings to numeric values has an external meaning. This holds
for the following lookup tables:
- netlink_family, ioprio_class, ip_tos, sched_policy - their numeric
values are stable as they are defined by the Linux kernel interface.
- log_level, log_facility_unshifted - the well-known syslog interface.
We allow the user to use numeric values whose string names systemd does
not know. For instance, the user may want to test a new kernel featuring
a scheduling policy that did not exist when his systemd version was
released. A slightly unpleasant effect of this is that the
name##_to_string conversion cannot return pointers to constant strings
anymore. The strings have to be allocated on demand and freed by the
caller.
Add specifier expansion to Path and String conditions.
Specifier expansion for conditions will help create instance
and user session units by allowing us to template conditions
based on the instance or user session parameters.
An example would be a system-wide user session service file
that conditionally runs based on whether a user has the
service configured through a configuration file in ~/.config/.
There is no point in clearing the bits of a "struct stat" when the very
next statement just calls stat or fstat to fill in that same memory.
[zj: two more places]
Names= is a source of errors, simply because alias names specified like
this only become relevant after a unit has been loaded but cannot be
used to load a unit.
Let's get rid of the confusion and drop this field. To establish alias
names peope should use symlinks, which have the the benefit of being
useful as key to load a unit, even though they are not taken into
account if unit names are listed but they haven't been explicitly
referenced before.
This also ensures that caps dropped from the bounding set are also
dropped from the inheritable set, to be extra-secure. Usually that should
change very little though as the inheritable set is empty for all our uses
anyway.
UnitPath= is also writable via native units and may be used by generators
to clarify from which file a unit is generated. This patch also hooks up
the cryptsetup and fstab generators to set UnitPath= accordingly.
This should help making the boot process a bit easier to explore and
understand for the administrator. The simple idea is that "systemctl
status" now shows a link to documentation alongside the other status and
decriptionary information of a service.
This patch adds the necessary fields to all our shipped units if we have
proper documentation for them.
RequiresMountsFor= is a shortcut for adding requires and after
dependencies to all mount units neeed for the specified paths.
This solves a couple of issues regarding dep loop cycles for encrypted
swap.
We finally got the OK from all contributors with non-trivial commits to
relicense systemd from GPL2+ to LGPL2.1+.
Some udev bits continue to be GPL2+ for now, but we are looking into
relicensing them too, to allow free copy/paste of all code within
systemd.
The bits that used to be MIT continue to be MIT.
The big benefit of the relicensing is that closed source code may now
link against libsystemd-login.so and friends.