Commit graph

92 commits

Author SHA1 Message Date
Alban Crequy 287419c119 containers: systemd exits with non-zero code
When a systemd service running in a container exits with a non-zero
code, it can be useful to terminate the container immediately and get
the exit code back to the host, when systemd-nspawn returns. This was
not possible to do. This patch adds the following to make it possible:

- Add a read-only "ExitCode" property on PID 1's "Manager" bus object.
  By default, it is 0 so the behaviour stays the same as previously.
- Add a method "SetExitCode" on the same object. The method fails when
  called on baremetal: it is only allowed in containers or in user
  session.
- Add support in systemctl to call "systemctl exit 42". It reuses the
  existing code for user session.
- Add exit.target and systemd-exit.service to the system instance.
- Change main() to actually call systemd-shutdown to exit() with the
  correct value.
- Add verb 'exit' in systemd-shutdown with parameter --exit-code
- Update systemctl manpage.

I used the following to test it:

| $ sudo rkt --debug --insecure-skip-verify run \
|            --mds-register=false --local docker://busybox \
|            --exec=/bin/chroot -- /proc/1/root \
|            systemctl --force exit 42
| ...
| Container rkt-895a0cba-5c66-4fa5-831c-e3f8ddc5810d failed with error code 42.
| $ echo $?
| 42

Fixes https://github.com/systemd/systemd/issues/1290
2015-09-21 17:32:45 +02:00
Daniel Mack 32ee7d3309 cgroup: add support for net_cls controllers
Add a new config directive called NetClass= to CGroup enabled units.
Allowed values are positive numbers for fix assignments and "auto" for
picking a free value automatically, for which we need to keep track of
dynamically assigned net class IDs of units. Introduce a hash table for
this, and also record the last ID that was given out, so the allocator
can start its search for the next 'hole' from there. This could
eventually be optimized with something like an irb.

The class IDs up to 65536 are considered reserved and won't be
assigned automatically by systemd. This barrier can be made a config
directive in the future.

Values set in unit files are stored in the CGroupContext of the
unit and considered read-only. The actually assigned number (which
may have been chosen dynamically) is stored in the unit itself and
is guaranteed to remain stable as long as the unit is active.

In the CGroup controller, set the configured CGroup net class to
net_cls.classid. Multiple unit may share the same net class ID,
and those which do are linked together.
2015-09-16 00:21:55 +02:00
Lennart Poettering 5269eb6b32 core: allocate sets of startup and failed units on-demand
There's a good chance we never needs these sets, hence allocate them
only when needed.
2015-09-11 18:31:49 +02:00
Lennart Poettering 03a7b521e3 core: add support for the "pids" cgroup controller
This adds support for the new "pids" cgroup controller of 4.3 kernels.
It allows accounting the number of tasks in a cgroup and enforcing
limits on it.

This adds two new setting TasksAccounting= and TasksMax= to each unit,
as well as a gloabl option DefaultTasksAccounting=.

This also updated "cgtop" to optionally make use of the new
kernel-provided accounting.

systemctl has been updated to show the number of tasks for each service
if it is available.

This patch also adds correct support for undoing memory limits for units
using a MemoryLimit=infinity syntax. We do the same for TasksMax= now
and hence keep things in sync here.
2015-09-10 18:41:06 +02:00
Lennart Poettering efdb02375b core: unified cgroup hierarchy support
This patch set adds full support the new unified cgroup hierarchy logic
of modern kernels.

A new kernel command line option "systemd.unified_cgroup_hierarchy=1" is
added. If specified the unified hierarchy is mounted to /sys/fs/cgroup
instead of a tmpfs. No further hierarchies are mounted. The kernel
command line option defaults to off. We can turn it on by default as
soon as the kernel's APIs regarding this are stabilized (but even then
downstream distros might want to turn this off, as this will break any
tools that access cgroupfs directly).

It is possibly to choose for each boot individually whether the unified
or the legacy hierarchy is used. nspawn will by default provide the
legacy hierarchy to containers if the host is using it, and the unified
otherwise. However it is possible to run containers with the unified
hierarchy on a legacy host and vice versa, by setting the
$UNIFIED_CGROUP_HIERARCHY environment variable for nspawn to 1 or 0,
respectively.

The unified hierarchy provides reliable cgroup empty notifications for
the first time, via inotify. To make use of this we maintain one
manager-wide inotify fd, and each cgroup to it.

This patch also removes cg_delete() which is unused now.

On kernel 4.2 only the "memory" controller is compatible with the
unified hierarchy, hence that's the only controller systemd exposes when
booted in unified heirarchy mode.

This introduces a new enum for enumerating supported controllers, plus a
related enum for the mask bits mapping to it. The core is changed to
make use of this everywhere.

This moves PID 1 into a new "init.scope" implicit scope unit in the root
slice. This is necessary since on the unified hierarchy cgroups may
either contain subgroups or processes but not both. PID 1 hence has to
move out of the root cgroup (strictly speaking the root cgroup is the
only one where processes and subgroups are still allowed, but in order
to support containers nicey, we move PID 1 into the new scope in all
cases.) This new unit is also used on legacy hierarchy setups. It's
actually pretty useful on all systems, as it can then be used to filter
journal messages coming from PID 1, and so on.

The root slice ("-.slice") is now implicitly created and started (and
does not require a unit file on disk anymore), since
that's where "init.scope" is located and the slice needs to be started
before the scope can.

To check whether we are in unified or legacy hierarchy mode we use
statfs() on /sys/fs/cgroup. If the .f_type field reports tmpfs we are in
legacy mode, if it reports cgroupfs we are in unified mode.

This patch set carefuly makes sure that cgls and cgtop continue to work
as desired.

When invoking nspawn as a service it will implicitly create two
subcgroups in the cgroup it is using, one to move the nspawn process
into, the other to move the actual container processes into. This is
done because of the requirement that cgroups may either contain
processes or other subgroups.
2015-09-01 23:52:27 +02:00
Lennart Poettering ae2a2c53dd manager: don't write first-boot flag file all the time
Instead, remember that we have already written it.
2015-09-01 17:20:56 +02:00
Daniel Mack bbc2908635 core: dbus: track bus names per unit
Currently, PID1 installs an unfiltered NameOwnerChanged signal match, and
dispatches the signals itself. This does not scale, as right now, PID1
wakes up every time a bus client connects.

To fix this, install individual matches once they are requested by
unit_watch_bus_name(), and remove the watches again through their slot in
unit_unwatch_bus_name().

If the bus is not available during unit_watch_bus_name(), just store
name in the 'watch_bus' hashmap, and let bus_setup_api() do the installing
later.
2015-08-06 10:14:41 +02:00
Lennart Poettering b2c23da8ce core: rename SystemdRunningAs to ManagerRunningAs
It's primarily just a property of the Manager object after all, and we
try to refer to PID 1 as "manager" instead of "systemd", hence let's to
stick to this here too.
2015-05-11 22:51:49 +02:00
Lennart Poettering f2341e0a87 core,network: major per-object logging rework
This changes log_unit_info() (and friends) to take a real Unit* object
insted of just a unit name as parameter. The call will now prefix all
logged messages with the unit name, thus allowing the unit name to be
dropped from the various passed romat strings, simplifying invocations
drastically, and unifying log output across messages. Also, UNIT= vs.
USER_UNIT= is now derived from the Manager object attached to the Unit
object, instead of getpid(). This has the benefit of correcting the
field for --test runs.

Also contains a couple of other logging improvements:

- Drops a couple of strerror() invocations in favour of using %m.

- Not only .mount units now warn if a symlinks exist for the mount
  point already, .automount units do that too, now.

- A few invocations of log_struct() that didn't actually pass any
  additional structured data have been replaced by simpler invocations
  of log_unit_info() and friends.

- For structured data a new LOG_UNIT_MESSAGE() macro has been added,
  that works like LOG_MESSAGE() but prefixes the message with the unit
  name. Similar, there's now LOG_LINK_MESSAGE() and
  LOG_NETDEV_MESSAGE().

- For structured data new LOG_UNIT_ID(), LOG_LINK_INTERFACE(),
  LOG_NETDEV_INTERFACE() macros have been added that generate the
  necessary per object fields. The old log_unit_struct() call has been
  removed in favour of these new macros used in raw log_struct()
  invocations. In addition to removing one more function call this
  allows generated structured log messages that contain two object
  fields, as necessary for example for network interfaces that are
  joined into another network interface, and whose messages shall be
  indexed by both.

- The LOG_ERRNO() macro has been removed, in favour of
  log_struct_errno(). The latter has the benefit of ensuring that %m in
  format strings is properly resolved to the specified error number.

- A number of logging messages have been converted to use
  log_unit_info() instead of log_info()

- The client code in sysv-generator no longer #includes core code from
  src/core/.

- log_unit_full_errno() has been removed, log_unit_full() instead takes
  an errno now, too.

- log_unit_info(), log_link_info(), log_netdev_info() and friends, now
  avoid double evaluation of their parameters
2015-05-11 22:24:45 +02:00
Lennart Poettering 8f88ecf623 core: for queued reload message there is no need to store the bus explicitly
After all it can be derived from the message directly, and already is.
2015-04-29 19:02:08 +02:00
Lucas De Marchi 03455c2879 core: emit changes for NFailedUnits property
By notifying the clients when this property is changed it's possible to
allow "system health monitor" tools to get transitions like
running<->degraded. This is an alternative to send changes on the
SystemState property since the latter is more difficult to derive.
2015-02-26 09:38:50 -05:00
Thomas Hindoe Paaboel Andersen 2eec67acbb remove unused includes
This patch removes includes that are not used. The removals were found with
include-what-you-use which checks if any of the symbols from a header is
in use.
2015-02-23 23:53:42 +01:00
Lennart Poettering 2e5c94b9aa core: when the user hits Ctrl-Alt-Del more than 7x per 2s, reboot immediately
This should be useful for cases where clean rebooting doesn't work, and
the user wants to hurry up the reboot.
2015-01-28 02:18:59 +01:00
Zbigniew Jędrzejewski-Szmek e801700e9a Implement masking and overriding of generators
Sometimes it is necessary to stop a generator from running. Either
because of a bug, or for testing, or some other reason. The only way
to do that would be to rename or chmod the generator binary, which is
inconvenient and does not survive upgrades. Allow masking and
overriding generators similarly to units and other configuration
files.

For the systemd instance, masking would be more common, rather than
overriding generators. For the user instances, it may also be useful
for users to have generators in $XDG_CONFIG_HOME to augment or
override system-wide generators.

Directories are searched according to the usual scheme (/usr/lib,
/usr/local/lib, /run, /etc), and files with the same name in higher
priority directories override files with the same name in lower
priority directories. Empty files and links to /dev/null mask a given
name.

https://bugs.freedesktop.org/show_bug.cgi?id=87230
2015-01-11 18:17:33 -05:00
Chris Leech befb6d5494 mount: monitor for utab changes with inotify
Parsing the mount table with libmount races against the mount command,
which will handle the actual mounting before updating utab.  This means
the poll event on /proc/self/mountinfo can kick of a reparse in systemd
before the utab information is available.

This change adds in an additional event source using inotify to watch
for changes to utab.  It only watches for IN_MOVED_TO events, matching
libmount behavior of always overwriting this file using rename(2).

This does add a second pass through the mount table parsing when utab is
updated.
2014-11-28 14:30:50 -05:00
Zbigniew Jędrzejewski-Szmek 06d8d842e9 manager: let manager_free() handle NULLs
This makes the calling code a bit simpler.
2014-11-23 19:17:28 -05:00
Zbigniew Jędrzejewski-Szmek ebc5788e88 manager: print warning on console before reboot
It will be printed even if a prompt is blocking other messages.
2014-10-27 23:17:49 -04:00
Zbigniew Jędrzejewski-Szmek 127d5fd156 manager: convert ephemeral to enum
In preparation for subsequent changes.
2014-10-27 23:02:54 -04:00
Zbigniew Jędrzejewski-Szmek e46b13c8c7 manager: do not print anything while passwords are being queried
https://bugs.freedesktop.org/show_bug.cgi?id=73942
2014-10-27 22:33:14 -04:00
Lennart Poettering fa1b91632c core: remove system start timeout logic again
The system start timeout as previously implemented would get confused by
long-running services that are included in the initial system startup
transaction for example by being cron-job-like long-running services
triggered immediately at boot. Such long-running jobs would be subject
to the default 15min timeout, esily triggering it.

Hence, remove this again. In a subsequent commit, introduce per-target
job timeouts instead, that allow us to control these timeouts more
finegrained.
2014-10-28 01:42:13 +01:00
Lennart Poettering d81afec1c9 core: split up "starting" manager state into "initializing" and "starting"
We'll stay in "initializing" until basic.target has reached, at which
point we will enter "starting".

This is preparation so that we can change the startip timeout to only
apply to the first phase of startup, not the full procedure.
2014-08-22 18:10:31 +02:00
Lennart Poettering 2928b0a863 core: add support for a configurable system-wide start-up timeout
When this system-wide start-up timeout is hit we execute one of the
failure actions already implemented for services that fail.

This should not only be useful on embedded devices, but also on laptops
which have the power-button reachable when the lid is closed. This
devices, when in a backpack might get powered on by accident due to the
easily reachable power button. We want to make sure that the system
turns itself off if it starts up due this after a while.

When the system manages to fully start-up logind will suspend the
machine by default if the lid is closed. However, in some cases we don't
even get as far as logind, and the boot hangs much earlier, for example
because we ask for a LUKS password that nobody ever enters.

Yeah, this is a real-life problem on my Yoga 13, which has one of those
easily accessible power buttons, even if the device is closed.
2014-08-22 18:10:31 +02:00
Stef Walter 283868e1dc core: Verify systemd1 DBus method callers via polkit
DBus methods that retrieve information can be called by anyone.

DBus methods that modify state of units are verified via polkit
action: org.freedesktop.systemd1.manage-units

DBus methods that modify state of unit files are verified via polkit
action: org.freedesktop.systemd1.manage-unit-files

DBus methods that reload the entire daemon state are verified via polkit
action: org.freedesktop.systemd1.reload-daemon

DBus methods that modify job state are callable from the clients
that started the job.

root (ie: CAP_SYS_ADMIN) can continue to perform all calls, property
access etc. There are several DBus methods that can only be
called by root.

Open up the dbus1 policy for the above methods.

(Heavily modified by Lennart, making use of the new
bus_verify_polkit_async() version that doesn't force us to always
pass the original callback around. Also, interactive auhentication must
be opt-in, not unconditional, hence I turned this off.)
2014-08-18 18:08:28 +02:00
Zbigniew Jędrzejewski-Szmek 0d8c31ff72 test-engine: fix access to unit load path
Also add a bit of debugging output to help diagnose problems,
add missing units, and simplify cppflags.

Move test-engine to normal tests from manual tests, it should now
work without destroying the system.
2014-07-20 19:48:16 -04:00
Lennart Poettering e26807239b firstboot: get rid of firstboot generator again, introduce ConditionFirstBoot= instead
As Zbigniew pointed out a new ConditionFirstBoot= appears like the nicer
way to hook in systemd-firstboot.service on first boots (those with /etc
unpopulated), so let's do this, and get rid of the generator again.
2014-07-07 21:05:09 +02:00
Lennart Poettering 418b9be500 firstboot: add new component to query basic system settings on first boot, or when creating OS images offline
A new tool "systemd-firstboot" can be used either interactively on boot,
where it will query basic locale, timezone, hostname, root password
information and set it. Or it can be used non-interactively from the
command line when prepareing disk images for booting. When used
non-inertactively the tool can either copy settings from the host, or
take settings on the command line.

$ systemd-firstboot --root=/path/to/my/new/root --copy-locale --copy-root-password --hostname=waldi

The tool will be automatically invoked (interactively) now on first boot
if /etc is found unpopulated.

This also creates the infrastructure for generators to be notified via
an environment variable whether they are running on the first boot, or
not.
2014-07-07 15:25:55 +02:00
Lennart Poettering 9a05490933 cgroups: simplify CPUQuota= logic
Only accept cpu quota values in percentages, get rid of period
definition.

It's not clear whether the CFS period controllable per-cgroup even has a
future in the kernel, hence let's simplify all this, hardcode the period
to 100ms and only accept percentage based quota values.
2014-05-22 11:53:12 +09:00
WaLyong Cho 95ae05c0e7 core: add startup resource control option
Similar to CPUShares= and BlockIOWeight= respectively. However only
assign the specified weight during startup. Each control group
attribute is re-assigned as weight by CPUShares=weight and
BlockIOWeight=weight after startup.  If not CPUShares= or
BlockIOWeight= be specified, then the attribute is re-assigned to each
default attribute value. (default cpu.shares=1024, blkio.weight=1000)
If only CPUShares=weight or BlockIOWeight=weight be specified, then
that implies StartupCPUShares=weight and StartupBlockIOWeight=weight.
2014-05-22 07:13:56 +09:00
Lennart Poettering b2f8b02ec2 core: expose CFS CPU time quota as high-level unit properties 2014-04-25 13:27:25 +02:00
Lennart Poettering bd8f585b99 core: add a setting to globally control the default for timer unit accuracy 2014-03-24 16:24:07 +01:00
Lennart Poettering f755e3b74b core: introduce system state enum
The system state knows the states starting →
running/degraded/maintenance → stopping, where:

starting = system startup
running = normal operation
degraded = at least one unit is currently in failed state
maintenance = rescue/emergency mode is active or queued
stopping = system shutdown
2014-03-12 20:55:13 +01:00
Lennart Poettering 517d56b1d0 missing: if RLIMIT_RTTIME is not defined by the libc, then we need a new define for the max number of rlimits, too 2014-03-05 02:31:09 +01:00
Lennart Poettering 4d7213b274 core: move ShowStatus type into the core
Let's make the scope of the show-status stuff a bit smaller, and make it
private to the core, rather than shared API in shared/.
2014-03-03 21:23:12 +01:00
Lennart Poettering e66cf1a3f9 core: introduce new RuntimeDirectory= and RuntimeDirectoryMode= unit settings
As discussed on the ML these are useful to manage runtime directories
below /run for services.
2014-03-03 17:55:32 +01:00
Lennart Poettering 8f8f05a919 bus: add sd_bus_track object for tracking peers, and port core over to it
This is primarily useful for services that need to track clients which
reference certain objects they maintain, or which explicitly want to
subscribe to certain events. Something like this is done in a large
number of services, and not trivial to do. Hence, let's unify this at
one place.

This also ports over PID 1 to use this to ensure that subscriptions to
job and manager events are correctly tracked. As a side-effect this
makes sure we properly serialize and restore the track list across
daemon reexec/reload, which didn't work correctly before.

This also simplifies how we distribute messages to broadcast to the
direct busses: we only track subscriptions for the API bus and
implicitly assume that all direct busses are subscribed. This should be
a pretty OK simplification since clients connected via direct bus
connections are shortlived anyway.
2014-03-03 02:34:13 +01:00
Lennart Poettering 085afe36cb core: add global settings for enabling CPUAccounting=, MemoryAccounting=, BlockIOAccounting= for all units at once 2014-02-24 23:50:10 +01:00
Lennart Poettering 5ba6985b6c core: allow PIDs to be watched by two units at the same time
In some cases it is interesting to map a PID to two units at the same
time. For example, when a user logs in via a getty, which is reexeced to
/sbin/login that binary will be explicitly referenced as main pid of the
getty service, as well as implicitly referenced as part of the session
scope.
2014-02-07 15:14:36 +01:00
Zbigniew Jędrzejewski-Szmek cb8ccb2271 manager: also turn on output on unit failure 2014-01-27 23:17:03 -05:00
Zbigniew Jędrzejewski-Szmek d450b6f2a9 manager: add systemd.show_status=auto mode
When set to auto, status will shown when the first ephemeral message
is shown (a job has been running for five seconds). Then until the
boot or shutdown ends, status messages will be shown.

No indication about the switch is done: I think it should be clear
for the user that first the cylon eye and the ephemeral messages appear,
and afterwards messages are displayed.

The initial arming of the event source was still wrong, but now should
really be fixed.
2014-01-27 23:17:03 -05:00
Lennart Poettering e3dd987cfc core: allocate a kdbus bus for each systemd instance, if we can 2013-11-30 03:53:42 +01:00
Lennart Poettering 9670d583d3 swap: always track the current real device node of all swap devices, even when not active
This way, we can avoid executing two /bin/swapon jobs to be dispatched
for the same swap device if it is configured for two different paths.

Previously we were just tracking the device nodes of active swap
devices, which would not allow us to recognize the identity of two swap
devices before they are active.

https://bugs.freedesktop.org/show_bug.cgi?id=69835
2013-11-25 22:10:22 +01:00
Lennart Poettering 5bcb0f2ba0 swap: split state machine state ACTIVATING into two
We expect the event on /proc/swaps before we expect the SIGCHILD,
reflect this in the state machine.
2013-11-25 17:40:53 +01:00
Lennart Poettering 752b590500 core: dispatch run queue only if there's nothing else to do
Always read all external events before we decide what we do next.
2013-11-25 17:40:53 +01:00
Lennart Poettering 718db96199 core: convert PID 1 to libsystemd-bus
This patch converts PID 1 to libsystemd-bus and thus drops the
dependency on libdbus. The only remaining code using libdbus is a test
case that validates our bus marshalling against libdbus' marshalling,
and this dependency can be turned off.

This patch also adds a couple of things to libsystem-bus, that are
necessary to make the port work:

- Synthesizing of "Disconnected" messages when bus connections are
  severed.

- Support for attaching multiple vtables for the same interface on the
  same path.

This patch also fixes the SetDefaultTarget() and GetDefaultTarget() bus
calls which used an inappropriate signature.

As a side effect we will now generate PropertiesChanged messages which
carry property contents, rather than just invalidation information.
2013-11-20 20:52:36 +01:00
Thomas Hindoe Paaboel Andersen c2e0d600ed analyze: plot the time spent setting up security modules 2013-11-10 23:21:15 +01:00
Lennart Poettering 9588bc3209 Remove dead code and unexport some calls
"make check-api-unused" informs us about code that is not used anymore
or that is exported but only used internally. Fix these all over the
place.
2013-11-08 18:12:45 +01:00
Lukas Nykryn 3f41e1e595 manager: configurable StartLimit default values
https://bugzilla.redhat.com/show_bug.cgi?id=821723
2013-11-08 17:00:01 +01:00
Oleksii Shevchuk 1f19a534ea Configurable Timeouts/Restarts default values
https://bugs.freedesktop.org/show_bug.cgi?id=71132

Patch adds DefaultTimeoutStartSec, DefaultTimeoutStopSec, DefaultRestartSec
configuration options to manager configuration file.
2013-11-05 19:57:22 +01:00
Lennart Poettering 44b601bc79 macro: clean up usage of gcc attributes
Always use our own macros, and name all our own macros the same style.
2013-10-16 06:14:59 +02:00
Lennart Poettering a57f7e2c82 core: rework how we match mount units against each other
Previously to automatically create dependencies between mount units we
matched every mount unit agains all others resulting in O(n^2)
complexity. On setups with large amounts of mount units this might make
things slow.

This change replaces the matching code to use a hashtable that is keyed
by a path prefix, and points to a set of units that require that path to
be around. When a new mount unit is installed it is hence sufficient to
simply look up this set of units via its own file system paths to know
which units to order after itself.

This patch also changes all unit types to only create automatic mount
dependencies via the RequiresMountsFor= logic, and this is exposed to
the outside to make things more transparent.

With this change we still have some O(n) complexities in place when
handling mounts, but that's currently unavoidable due to kernel APIs,
and still substantially better than O(n^2) as before.

https://bugs.freedesktop.org/show_bug.cgi?id=69740
2013-09-26 20:20:30 +02:00