Apparently, util-linux' mount command implicitly drops the smack-related
options anyway before passing them to the kernel, if the kernel doesn't
know SMACK, hence there's no point in duplicating this in systemd.
Fixes#1696
This directive allows passing environment variables from the system
manager to spawned services. Variables in the system manager can be set
inside a container by passing `--set-env=...` options to systemd-spawn.
Tested with an on-disk test.service unit. Tested using multiple variable
names on a single line, with an empty setting to clear the current list
of variables, with non-existing variables.
Tested using `systemd-run -p PassEnvironment=VARNAME` to confirm it
works with transient units.
Confirmed that `systemctl show` will display the PassEnvironment
settings.
Checked that man pages are generated correctly.
No regressions in `make check`.
Snapshots were never useful or used for anything. Many systemd
developers that I spoke to at systemd.conf2015, didn't even know they
existed, so it is fairly safe to assume that this type can be deleted
without harm.
The fundamental problem with snapshots is that the state of the system
is dynamic, devices come and go, users log in and out, timers fire...
and restoring all units to some state from the past would "undo"
those changes, which isn't really possible.
Tested by creating a snapshot, running the new binary, and checking
that the transition did not cause errors, and the snapshot is gone,
and snapshots cannot be created anymore.
New systemctl says:
Unknown operation snapshot.
Old systemctl says:
Failed to create snapshot: Support for snapshots has been removed.
IgnoreOnSnaphost settings are warned about and ignored:
Support for option IgnoreOnSnapshot= has been removed and it is ignored
http://lists.freedesktop.org/archives/systemd-devel/2015-November/034872.html
Let's make sure "LimitCPU=30min" can be parsed properly, following the
usual logic how we parse time values. Similar for LimitRTTIME=.
While we are at it, extend a bit on the man page section about resource
limits.
Fixes: #1772
Let's make things more user-friendly and support for example
LimitAS=16G
rather than force users to always use LimitAS=16106127360.
The change is relevant for options:
[Default]Limit{FSIZE,DATA,STACK,CORE,RSS,AS,MEMLOCK,MSGQUEUE}
The patch introduces config_parse_bytes_limit(), it's the same as
config_parse_limit() but uses parse_size() tu support the suffixes.
Addresses: https://github.com/systemd/systemd/issues/1772
This option specifies the label to assign the root of the file system if
it lacks the Smack extended attribute. Note that this option will be
ignored if kernel does not support the Smack feature by runtime
checking.
This adds support for naming file descriptors passed using socket
activation. The names are passed in a new $LISTEN_FDNAMES= environment
variable, that matches the existign $LISTEN_FDS= one and contains a
colon-separated list of names.
This also adds support for naming fds submitted to the per-service fd
store using FDNAME= in the sd_notify() message.
This also adds a new FileDescriptorName= setting for socket unit files
to set the name for fds created by socket units.
This also adds a new call sd_listen_fds_with_names(), that is similar to
sd_listen_fds(), but also returns the names of the fds.
systemd-activate gained the new --fdname= switch to specify a name for
testing socket activation.
This is based on #1247 by Maciej Wereski.
Fixes#1247.
Writable= is a new boolean setting. If ture, then ListenSpecial= will
open the specified path in O_RDWR mode, rather than just O_RDONLY.
This is useful for implementing services like rfkill, where /dev/rfkill
is more useful when opened in write mode, if we want to not only save
but also restore its state.
If set to ~ the working directory is set to the home directory of the
user configured in User=.
This change also exposes the existing switch for the working directory
that allowed making missing working directories non-fatal.
This also changes "machinectl shell" to make use of this to ensure that
the invoked shell is by default in the user's home directory.
Fixes#1268.
Add a new config directive called NetClass= to CGroup enabled units.
Allowed values are positive numbers for fix assignments and "auto" for
picking a free value automatically, for which we need to keep track of
dynamically assigned net class IDs of units. Introduce a hash table for
this, and also record the last ID that was given out, so the allocator
can start its search for the next 'hole' from there. This could
eventually be optimized with something like an irb.
The class IDs up to 65536 are considered reserved and won't be
assigned automatically by systemd. This barrier can be made a config
directive in the future.
Values set in unit files are stored in the CGroupContext of the
unit and considered read-only. The actually assigned number (which
may have been chosen dynamically) is stored in the unit itself and
is guaranteed to remain stable as long as the unit is active.
In the CGroup controller, set the configured CGroup net class to
net_cls.classid. Multiple unit may share the same net class ID,
and those which do are linked together.
This adds support for the new "pids" cgroup controller of 4.3 kernels.
It allows accounting the number of tasks in a cgroup and enforcing
limits on it.
This adds two new setting TasksAccounting= and TasksMax= to each unit,
as well as a gloabl option DefaultTasksAccounting=.
This also updated "cgtop" to optionally make use of the new
kernel-provided accounting.
systemctl has been updated to show the number of tasks for each service
if it is available.
This patch also adds correct support for undoing memory limits for units
using a MemoryLimit=infinity syntax. We do the same for TasksMax= now
and hence keep things in sync here.
.nspawn fiels are simple settings files that may accompany container
images and directories and contain settings otherwise passed on the
nspawn command line. This provides an efficient way to attach execution
data directly to containers.
When generating utmp/wtmp entries, optionally add both LOGIN_PROCESS and
INIT_PROCESS entries or even all three of LOGIN_PROCESS, INIT_PROCESS
and USER_PROCESS entries, instead of just a single INIT_PROCESS entry.
With this change systemd may be used to not only invoke a getty directly
in a SysV-compliant way but alternatively also a login(1) implementation
or even forego getty and login entirely, and invoke arbitrary shells in
a way that they appear in who(1) or w(1).
This is preparation for a later commit that adds a "machinectl shell"
operation to invoke a shell in a container, in a way that is compatible
with who(1) and w(1).
./configure --enable/disable-kdbus can be used to set the default
behavior regarding kdbus.
If no kdbus kernel support is available, dbus-dameon will be used.
With --enable-kdbus, the kernel command line option "kdbus=0" can
be used to disable kdbus.
With --disable-kdbus, the kernel command line option "kdbus=1" is
required to enable kdbus support.
With this change it is possible to send file descriptors to PID 1, via
sd_pid_notify_with_fds() which PID 1 will store individually for each
service, and pass via the usual fd passing logic on next invocation.
This is useful for enable daemon reload schemes where daemons serialize
their state to /run, push their fds into PID 1 and terminate, restoring
their state on next start from the data in /run and passed in from PID
1.
The fds are kept by PID 1 as long as no POLLHUP or POLLERR is seen on
them, and the service they belong to are either not dead or failed, or
have a job queued.
In service file, if the file has some of special SMACK label in
ExecStart= and systemd has no permission for the special SMACK label
then permission error will occurred. To resolve this, systemd should
be able to set its SMACK label to something accessible of ExecStart=.
So introduce new SmackProcessLabel. If label is specified with
SmackProcessLabel= then the child systemd will set its label to
that. To successfully execute the ExecStart=, accessible label should
be specified with SmackProcessLabel=.
Additionally, by SMACK policy, if the file in ExecStart= has no
SMACK64EXEC then the executed process will have given label by
SmackProcessLabel=. But if the file has SMACK64EXEC then the
SMACK64EXEC label will be overridden.
[zj: reword man page]
For priviliged units this resource control property ensures that the
processes have all controllers systemd manages enabled.
For unpriviliged services (those with User= set) this ensures that
access rights to the service cgroup is granted to the user in question,
to create further subgroups. Note that this only applies to the
name=systemd hierarchy though, as access to other controllers is not
safe for unpriviliged processes.
Delegate=yes should be set for container scopes where a systemd instance
inside the container shall manage the hierarchies below its own cgroup
and have access to all controllers.
Delegate=yes should also be set for user@.service, so that systemd
--user can run, controlling its own cgroup tree.
This commit changes machined, systemd-nspawn@.service and user@.service
to set this boolean, in order to ensure that container management will
just work, and the user systemd instance can run fine.
For now, it's systemd itself that parses the options string, but as soon
as util-linux' swapon can take the option string directly with -o we
should pass it on unmodified.
This makes possible to spawn service instances triggered by socket with
MLS/MCS SELinux labels which are created based on information provided by
connected peer.
Implementation of label_get_child_mls_label derived from xinetd.
Reviewed-by: Paul Moore <pmoore@redhat.com>
Add a new directive called BusPolicy to define custom endpoint policies. If
one such directive is given, an endpoint object in the service's ExecContext is
created and the given policy is added to it.
This is what we have done so far for all other time values, and hence we
should do this here. This indicates the default unit of time values
specified here, if they don't contain a unit.
This makes possible to spawn service instances triggered by socket with
MLS/MCS SELinux labels which are created based on information provided by
connected peer.
Implementation of label_get_child_label derived from xinetd.
Reviewed-by: Paul Moore <pmoore@redhat.com>
TCP_DEFER_ACCEPT Allow a listener to be awakened only when data
arrives on the socket. If TCP_DEFER_ACCEPT set on a server-side
listening socket, the TCP/IP stack will not to wait for the final
ACK packet and not to initiate the process until the first packet
of real data has arrived. After sending the SYN/ACK, the server will
then wait for a data packet from a client. Now, only three packets
will be sent over the network, and the connection establishment delay
will be significantly reduced.
The tcp keep alive variables now can be configured via conf
parameter. Follwing variables are now supported by this patch.
tcp_keepalive_intvl: The number of seconds between TCP keep-alive probes
tcp_keepalive_probes: The maximum number of TCP keep-alive probes to
send before giving up and killing the connection if no response is
obtained from the other end.
tcp_keepalive_time: The number of seconds a connection needs to be
idle before TCP begins sending out keep-alive probes.
This reverts commit 9528592ff8.
Apparently TFO is actually the default at least for the server side now.
Also the setsockopt doesn't actually take a bool, but a qlen integer.
TCP Fast Open (TFO) speeds up the opening of successiveTCP)
connections between two endpoints.It works by using a TFO cookie
in the initial SYN packet to authenticate a previously connected
client. It starts sending data to the client before the receipt
of the final ACK packet of the three way handshake is received,
skipping a round trip and lowering the latency in the start of
transmission of data.
This patch adds support for TCP TCP_NODELAY socket option. This can be
configured via NoDelay conf parameter. TCP Nagle's algorithm works by
combining a number of small outgoing messages, and sending them all at
once. This controls the TCP_NODELAY socket option.
As Zbigniew pointed out a new ConditionFirstBoot= appears like the nicer
way to hook in systemd-firstboot.service on first boots (those with /etc
unpopulated), so let's do this, and get rid of the generator again.
The DefaultInstance= name is used when enabling template units when only
specifying the template name, but no instance.
Add DefaultInstance=tty1 to getty@.service, so that when the template
itself is enabled an instance for tty1 is created.
This is useful so that we "systemctl preset-all" can work properly,
because we can operate on getty@.service after finding it, and the right
instance is created.
This new condition allows checking whether /etc or /var are out-of-date
relative to /usr. This is the counterpart for the update flag managed by
systemd-update-done.service. Services that want to be started once after
/usr got updated should use:
[Unit]
ConditionNeedsUpdate=/etc
Before=systemd-update-done.service
This makes sure that they are only run if /etc is out-of-date relative
to /usr. And that it will be executed after systemd-update-done.service
which is responsible for marking /etc up-to-date relative to the current
/usr.
ConditionNeedsUpdate= will also checks whether /etc is actually
writable, and not trigger if it isn't, since no update is possible then.
Also, rename ProtectedHome= to ProtectHome=, to simplify things a bit.
With this in place we now have two neat options ProtectSystem= and
ProtectHome= for protecting the OS itself (and optionally its
configuration), and for protecting the user's data.
With Symlinks= we can manage one or more symlinks to AF_UNIX or FIFO
nodes in the file system, with the same lifecycle as the socket itself.
This has two benefits: first, this allows us to remove /dev/log and
/dev/initctl from /dev, thus leaving only symlinks, device nodes and
directories in the /dev tree. More importantly however, this allows us
to move /dev/log out of /dev, while still making it accessible there, so
that PrivateDevices= can provide /dev/log too.
ReadOnlySystem= uses fs namespaces to mount /usr and /boot read-only for
a service.
ProtectedHome= uses fs namespaces to mount /home and /run/user
inaccessible or read-only for a service.
This patch also enables these settings for all our long-running services.
Together they should be good building block for a minimal service
sandbox, removing the ability for services to modify the operating
system or access the user's private data.
Only accept cpu quota values in percentages, get rid of period
definition.
It's not clear whether the CFS period controllable per-cgroup even has a
future in the kernel, hence let's simplify all this, hardcode the period
to 100ms and only accept percentage based quota values.
Similar to CPUShares= and BlockIOWeight= respectively. However only
assign the specified weight during startup. Each control group
attribute is re-assigned as weight by CPUShares=weight and
BlockIOWeight=weight after startup. If not CPUShares= or
BlockIOWeight= be specified, then the attribute is re-assigned to each
default attribute value. (default cpu.shares=1024, blkio.weight=1000)
If only CPUShares=weight or BlockIOWeight=weight be specified, then
that implies StartupCPUShares=weight and StartupBlockIOWeight=weight.
When rebooting with systemctl, an optional argument can be passed to the
reboot system call. This makes it possible the specify the argument in a
service file and use it when the service triggers a restart.
This is useful to distinguish between manual reboots and reboots caused by
failing services.
tcpwrap is legacy code, that is barely maintained upstream. It's APIs
are awful, and the feature set it exposes (such as DNS and IDENT
access control) questionnable. We should not support this natively in
systemd.
Hence, let's remove the code. If people want to continue making use of
this, they can do so by plugging in "tcpd" for the processes they start.
With that scheme things are as well or badly supported as they were from
traditional inetd, hence no functionality is really lost.
Add a new config 'Activating' directive which denotes whether a busname
is actually registered on the bus. It defaults to 'yes'.
If set to 'no', the .busname unit only uploads policy, which will remain
active as long as the unit is running.
AcceptFD= defaults to true, thus making sure that by default fd passing
is enabled for all activatable names. Since for normal bus connections
fd passing is enabled too by default this makes sure fd passing works
correctly regardless whether a service is already activated or not.
Making this configurable on both busname units and in bus connections is
messy, but unavoidable since busnames are established and may queue
messages before the connection feature negotiation is done by the
service eventually activated. Conversely, feature negotiation on bus
connections takes place before the connection acquires its names.
Of course, this means developers really should make sure to keep the
settings in .busname units in sync with what they later intend to
negotiate.
There are three directives to specify bus name polices in .busname
files:
* AllowUser [username] [access]
* AllowGroup [groupname] [access]
* AllowWorld [access]
Where [access] is one of
* 'see': The user/group/world is allowed to see a name on the bus
* 'talk': The user/group/world is allowed to talk to a name
* 'own': The user/group/world is allowed to own a name
There is no user added yet in this commit.
This new unit settings allows restricting which address families are
available to processes. This is an effective way to minimize the attack
surface of services, by turning off entire network stacks for them.
This is based on seccomp, and does not work on x86-32, since seccomp
cannot filter socketcall() syscalls on that platform.
According to Wikipedia it is customary to specify hardware metrics and
transfer speeds to the basis 1000 (SI decimal), while software metrics
and physical volatile memory (RAM) sizes to the basis 1024 (IEC binary).
So far we specified everything in IEC, let's fix that and be more
true to what's otherwise customary. Since we don't want to parse "Mi"
instead of "M" we document each time what the context used is.
This permit to switch to a specific apparmor profile when starting a daemon. This
will result in a non operation if apparmor is disabled.
It also add a new build requirement on libapparmor for using this feature.
If we put a closing bracket on its own line, gperf will complain about
empty lines. Only occurs if the option in question is disabled. So fix the
m4 macros to work properly in both cases.
This is useful to prohibit execution of non-native processes on systems,
for example 32bit binaries on 64bit systems, this lowering the attack
service on incorrect syscall and ioctl 32→64bit mappings.
- Allow configuration of an errno error to return from blacklisted
syscalls, instead of immediately terminating a process.
- Fix parsing logic when libseccomp support is turned off
- Only keep the actual syscall set in the ExecContext, and generate the
string version only on demand.
This permit to let system administrators decide of the domain of a service.
This can be used with templated units to have each service in a différent
domain ( for example, a per customer database, using MLS or anything ),
or can be used to force a non selinux enabled system (jvm, erlang, etc)
to start in a different domain for each service.
Similar to PrivateNetwork=, PrivateTmp= introduce PrivateDevices= that
sets up a private /dev with only the API pseudo-devices like /dev/null,
/dev/zero, /dev/random, but not any physical devices in them.
This patch converts PID 1 to libsystemd-bus and thus drops the
dependency on libdbus. The only remaining code using libdbus is a test
case that validates our bus marshalling against libdbus' marshalling,
and this dependency can be turned off.
This patch also adds a couple of things to libsystem-bus, that are
necessary to make the port work:
- Synthesizing of "Disconnected" messages when bus connections are
severed.
- Support for attaching multiple vtables for the same interface on the
same path.
This patch also fixes the SetDefaultTarget() and GetDefaultTarget() bus
calls which used an inappropriate signature.
As a side effect we will now generate PropertiesChanged messages which
carry property contents, rather than just invalidation information.
We now treat passno as boleans in the generators, and don't need this any more. fsck itself
is able to sequentialize checks on the same local media, so in the common case the ordering
is redundant.
It is still possible to force an order by using .d fragments, in case that is desired.
Previously to automatically create dependencies between mount units we
matched every mount unit agains all others resulting in O(n^2)
complexity. On setups with large amounts of mount units this might make
things slow.
This change replaces the matching code to use a hashtable that is keyed
by a path prefix, and points to a set of units that require that path to
be around. When a new mount unit is installed it is hence sufficient to
simply look up this set of units via its own file system paths to know
which units to order after itself.
This patch also changes all unit types to only create automatic mount
dependencies via the RequiresMountsFor= logic, and this is exposed to
the outside to make things more transparent.
With this change we still have some O(n) complexities in place when
handling mounts, but that's currently unavoidable due to kernel APIs,
and still substantially better than O(n^2) as before.
https://bugs.freedesktop.org/show_bug.cgi?id=69740
The cgroup attribute memory.soft_limit_in_bytes is unlikely to stay
around in the kernel for good, so let's not expose it for now. We can
readd something like it later when the kernel guys decided on a final
API for this.
Replace the very generic cgroup hookup with a much simpler one. With
this change only the high-level cgroup settings remain, the ability to
set arbitrary cgroup attributes is removed, so is support for adding
units to arbitrary cgroup controllers or setting arbitrary paths for
them (especially paths that are different for the various controllers).
This also introduces a new -.slice root slice, that is the parent of
system.slice and friends. This enables easy admin configuration of
root-level cgrouo properties.
This replaces DeviceDeny= by DevicePolicy=, and implicitly adds in
/dev/null, /dev/zero and friends if DeviceAllow= is used (unless this is
turned off by DevicePolicy=).
In order to prepare for the kernel cgroup rework, let's introduce a new
unit type to systemd, the "slice". Slices can be arranged in a tree and
are useful to partition resources freely and hierarchally by the user.
Each service unit can now be assigned to one of these slices, and later
on login users and machines may too.
Slices translate pretty directly to the cgroup hierarchy, and the
various objects can be assigned to any of the slices in the tree.
Instead of having explicit type-specific callbacks that inform the
triggering unit when a triggered unit changes state, make this generic
so that state changes are forwarded betwee any triggered and triggering
unit.
Also, get rid of UnitRef references from automount, timer, path units,
to the units they trigger and rely exclsuively on UNIT_TRIGGER type
dendencies.
Internally we store all time values in usec_t, however parse_usec()
actually was used mostly to parse values in seconds (unless explicit
units were specified to define a different unit). Hence, be clear about
this and name the function about what we pass into it, not what we get
out of it.
This introduces a new static list of known attributes and their special
semantics. This means that cgroup attribute values can now be
automatically translated from user to kernel notation for command line
set settings, too.
This also adds proper support for multi-line attributes.
Since we already allow defining the mode of AF_UNIX sockets and FIFO, it
makes sense to also allow specific user/group ownership of the socket
file for restricting access.
This adds SMACK label configuration options to socket units.
SMACK labels should be applied to most objects on disk well before
execution time, but two items remain that are generated dynamically
at run time that require SMACK labels to be set in order to enforce
MAC on all objects.
Files on disk can be labelled using package management.
For device nodes, simple udev rules are sufficient to add SMACK labels
at boot/insertion time.
Sockets can be created at run time and systemd does just that for
several services. In order to protect FIFO's and UNIX domain sockets,
we must instruct systemd to apply SMACK labels at runtime.
This patch adds the following options:
Smack - applicable to FIFO's.
SmackIpIn/SmackIpOut - applicable to sockets.
No external dependencies are required to support SMACK, as setting
the labels is done using fsetxattr(). The labels can be set on a
kernel that does not have SMACK enabled either, so there is no need
to #ifdef any of this code out.
For more information about SMACK, please see Documentation/Smack.txt
in the kernel source code.
v3 of this patch changes the config options to be CamelCased.
In some cases, like wrong configuration, restarting after error
does not help, so administrator can specify statuses by RestartPreventExitStatus
which will not cause restart of a service.
Sometimes you have non-standart exit status, so this can be specified
by SuccessfulExitStatus.
This should address TODO item "new dependency type to "group" services
in a target". Semantic of new dependency is as follows. Once configured
it creates dependency which will cause that all dependent units get
stopped if unit they all depend on is stopped or restarted. Usual use
case would be configuring PartOf=some.target in template unit file
and WantedBy=some.target in [Install] section and enabling desired
number of instances. In this case starting one instance won't pull in
target but stopping or starting target(in case of WantedBy is properly
configured) will cause stop/start of all instances.
all other dependencies are in 3rd person. Change BindTo= accordingly to
BindsTo=.
Of course, the dependency is widely used, hence we parse the old name
too for compatibility.
Names= is a source of errors, simply because alias names specified like
this only become relevant after a unit has been loaded but cannot be
used to load a unit.
Let's get rid of the confusion and drop this field. To establish alias
names peope should use symlinks, which have the the benefit of being
useful as key to load a unit, even though they are not taken into
account if unit names are listed but they haven't been explicitly
referenced before.
This also ensures that caps dropped from the bounding set are also
dropped from the inheritable set, to be extra-secure. Usually that should
change very little though as the inheritable set is empty for all our uses
anyway.
UnitPath= is also writable via native units and may be used by generators
to clarify from which file a unit is generated. This patch also hooks up
the cryptsetup and fstab generators to set UnitPath= accordingly.
This should help making the boot process a bit easier to explore and
understand for the administrator. The simple idea is that "systemctl
status" now shows a link to documentation alongside the other status and
decriptionary information of a service.
This patch adds the necessary fields to all our shipped units if we have
proper documentation for them.
RequiresMountsFor= is a shortcut for adding requires and after
dependencies to all mount units neeed for the specified paths.
This solves a couple of issues regarding dep loop cycles for encrypted
swap.