Commit graph

265 commits

Author SHA1 Message Date
Zbigniew Jędrzejewski-Szmek 74ad38ff0e Merge pull request #3160 from htejun/cgroup-fixes-rev2
Cgroup fixes.
2016-05-07 15:08:57 -04:00
Tejun Heo 13c31542cc core: add io controller support on the unified hierarchy
On the unified hierarchy, blkio controller is renamed to io and the interface
is changed significantly.

* blkio.weight and blkio.weight_device are consolidated into io.weight which
  uses the standardized weight range [1, 10000] with 100 as the default value.

* blkio.throttle.{read|write}_{bps|iops}_device are consolidated into io.max.
  Expansion of throttling features is being worked on to support
  work-conserving absolute limits (io.low and io.high).

* All stats are consolidated into io.stats.

This patchset adds support for the new interface.  As the interface has been
revamped and new features are expected to be added, it seems best to treat it
as a separate controller rather than trying to expand the blkio settings
although we might add automatic translation if only blkio settings are
specified.

* io.weight handling is mostly identical to blkio.weight[_device] handling
  except that the weight range is different.

* Both read and write bandwidth settings are consolidated into
  CGroupIODeviceLimit which describes all limits applicable to the device.
  This makes it less painful to add new limits.

* "max" can be used to specify the maximum limit which is equivalent to no
  config for max limits and treated as such.  If a given CGroupIODeviceLimit
  doesn't contain any non-default configs, the config struct is discarded once
  the no limit config is applied to cgroup.

* lookup_blkio_device() is renamed to lookup_block_device().

Signed-off-by: Tejun Heo <htejun@fb.com>
2016-05-05 16:43:06 -04:00
Lennart Poettering d8fdc62037 core: use an AF_UNIX/SOCK_DGRAM socket for cgroup agent notification
dbus-daemon currently uses a backlog of 30 on its D-bus system bus socket. On
overloaded systems this means that only 30 connections may be queued without
dbus-daemon processing them before further connection attempts fail. Our
cgroups-agent binary so far used D-Bus for its messaging, and hitting this
limit hence may result in us losing cgroup empty messages.

This patch adds a seperate cgroup agent socket of type AF_UNIX/SOCK_DGRAM.
Since sockets of these types need no connection set up, no listen() backlog
applies. Our cgroup-agent binary will hence simply block as long as it can't
enqueue its datagram message, so that we won't lose cgroup empty messages as
likely anymore.

This also rearranges the ordering of the processing of SIGCHLD signals, service
notification messages (sd_notify()...) and the two types of cgroup
notifications (inotify for the unified hierarchy support, and agent for the
classic hierarchy support). We now always process events for these in the
following order:

  1. service notification messages  (SD_EVENT_PRIORITY_NORMAL-7)
  2. SIGCHLD signals (SD_EVENT_PRIORITY_NORMAL-6)
  3. cgroup inotify and cgroup agent (SD_EVENT_PRIORITY_NORMAL-5)

This is because when receiving SIGCHLD we invalidate PID information, which we
need to process the service notification messages which are bound to PIDs.
Hence the order between the first two items. And we want to process SIGCHLD
metadata to detect whether a service is gone, before using cgroup
notifications, to decide when a service is gone, since the former carries more
useful metadata.

Related to this:
https://bugs.freedesktop.org/show_bug.cgi?id=95264
https://github.com/systemd/systemd/issues/1961
2016-05-05 12:37:04 +02:00
Tejun Heo ccf78df1fc core: make unit_has_mask_realized() consider controller enable state
unit_has_mask_realized() determines whether the specified unit has its cgroups
set up properly given the desired target_mask; however, on the unified
hierarchy, controllers need to be enabled explicitly for children and the mask
of enabled controllers can deviate from target_mask.  Only considering
target_mask in unit_has_mask_realized() can lead to false positives and
skipping enabling the requested controllers.

This patch adds unit->cgroup_enabled_mask to track which controllers are
enabled and updates unit_has_mask_realized() to also consider enable_mask.

Signed-off-by: Tejun Heo <htejun@fb.com>
2016-04-30 16:12:54 -04:00
Lennart Poettering 463d0d1569 core: remove ManagerRunningAs enum
Previously, we had two enums ManagerRunningAs and UnitFileScope, that were
mostly identical and converted from one to the other all the time. The latter
had one more value UNIT_FILE_GLOBAL however.

Let's simplify things, and remove ManagerRunningAs and replace it by
UnitFileScope everywhere, thus making the translation unnecessary. Introduce
two new macros MANAGER_IS_SYSTEM() and MANAGER_IS_USER() to simplify checking
if we are running in one or the user context.
2016-04-12 13:43:30 +02:00
Tejun Heo ab2c3861dc core: update populated event handling in unified hierarchy
Earlier during the development of unified hierarchy, the populated event was
reported through by the dedicated "cgroup.populated" file; however, the
interface was updated so that it's reported through the "populated" field of
"cgroup.events" file.  Update populated event handling logic accordingly.
2016-03-26 12:05:57 -04:00
Daniel Mack 50f48ad37a cgroup: remove support for NetClass= directive
Support for net_cls.class_id through the NetClass= configuration directive
has been added in v227 in preparation for a per-unit packet filter mechanism.
However, it turns out the kernel people have decided to deprecate the net_cls
and net_prio controllers in v2. Tejun provides a comprehensive justification
for this in his commit, which has landed during the merge window for kernel
v4.5:

  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bd1060a1d671

As we're aiming for full support for the v2 cgroup hierarchy, we can no
longer support this feature. Userspace tool such as nftables are moving over
to setting rules that are specific to the full cgroup path of a task, which
obsoletes these controllers anyway.

This commit removes support for tweaking details in the net_cls controller,
but keeps the NetClass= directive around for legacy compatibility reasons.
2016-02-10 16:38:56 +01:00
Daniel Mack b26fa1a2fb tree-wide: remove Emacs lines from all files
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
2016-02-10 13:41:57 +01:00
Lennart Poettering 077ba06eaa core: don't generate warnings when write access to the cgroup fs fails in --user due to EACCES
After all, in the classic hierarchy that's pretty much the default case.
2015-11-17 00:52:10 +01:00
Lennart Poettering 7760171904 util-lib: move inotify-related definitions to fs-util.[ch] 2015-10-27 14:58:05 +01:00
Lennart Poettering 6bc73acb01 process-util: rename get_parent_of_pid() → get_process_ppid()
In order to match the other get_process_xyz() calls.
2015-10-27 14:01:48 +01:00
Lennart Poettering b5efdb8af4 util-lib: split out allocation calls into alloc-util.[ch] 2015-10-27 13:45:53 +01:00
Lennart Poettering 8b43440b7e util-lib: move string table stuff into its own string-table.[ch] 2015-10-27 13:25:56 +01:00
Lennart Poettering 0d39fa9c69 util-lib: move more file I/O related calls into fileio.[ch] 2015-10-27 13:25:55 +01:00
Lennart Poettering 6bedfcbb29 util-lib: split string parsing related calls from util.[ch] into parse-util.[ch] 2015-10-27 13:25:55 +01:00
Lennart Poettering 3ffd4af220 util-lib: split out fd-related operations into fd-util.[ch]
There are more than enough to deserve their own .c file, hence move them
over.
2015-10-25 13:19:18 +01:00
Lennart Poettering 07630cea1f util-lib: split our string related calls from util.[ch] into its own file string-util.[ch]
There are more than enough calls doing string manipulations to deserve
its own files, hence do something about it.

This patch also sorts the #include blocks of all files that needed to be
updated, according to the sorting suggestions from CODING_STYLE. Since
pretty much every file needs our string manipulation functions this
effectively means that most files have sorted #include blocks now.

Also touches a few unrelated include files.
2015-10-24 23:05:02 +02:00
Daniel Mack 32ee7d3309 cgroup: add support for net_cls controllers
Add a new config directive called NetClass= to CGroup enabled units.
Allowed values are positive numbers for fix assignments and "auto" for
picking a free value automatically, for which we need to keep track of
dynamically assigned net class IDs of units. Introduce a hash table for
this, and also record the last ID that was given out, so the allocator
can start its search for the next 'hole' from there. This could
eventually be optimized with something like an irb.

The class IDs up to 65536 are considered reserved and won't be
assigned automatically by systemd. This barrier can be made a config
directive in the future.

Values set in unit files are stored in the CGroupContext of the
unit and considered read-only. The actually assigned number (which
may have been chosen dynamically) is stored in the unit itself and
is guaranteed to remain stable as long as the unit is active.

In the CGroup controller, set the configured CGroup net class to
net_cls.classid. Multiple unit may share the same net class ID,
and those which do are linked together.
2015-09-16 00:21:55 +02:00
Lennart Poettering e7ab4d1ac9 cgroup: unify how we invalidate cgroup controller settings
Let's make sure that we follow the same codepaths when adjusting a
cgroup property via the dbus SetProperty() call, and when we execute the
StartupCPUShares= effect.
2015-09-11 18:31:50 +02:00
Lennart Poettering d53d94743c core: refactor cpu shares/blockio weight cgroup logic
Let's stop using the "unsigned long" type for weights/shares, and let's
just use uint64_t for this, as that's what we expose on the bus.

Unify parsers, and always validate the range for these fields.

Correct the default blockio weight to 500, since that's what the kernel
actually uses.

When parsing the weight/shares settings from unit files accept the empty
string as a way to reset the weight/shares value. When getting it via
the bus, uniformly map (uint64_t) -1 to unset.

Open up StartupCPUShares= and StartupBlockIOWeight= to transient units.
2015-09-11 18:31:49 +02:00
Lennart Poettering 03a7b521e3 core: add support for the "pids" cgroup controller
This adds support for the new "pids" cgroup controller of 4.3 kernels.
It allows accounting the number of tasks in a cgroup and enforcing
limits on it.

This adds two new setting TasksAccounting= and TasksMax= to each unit,
as well as a gloabl option DefaultTasksAccounting=.

This also updated "cgtop" to optionally make use of the new
kernel-provided accounting.

systemctl has been updated to show the number of tasks for each service
if it is available.

This patch also adds correct support for undoing memory limits for units
using a MemoryLimit=infinity syntax. We do the same for TasksMax= now
and hence keep things in sync here.
2015-09-10 18:41:06 +02:00
Lennart Poettering 3905f12713 cgroups: make sure the "devices" controller's enum is named the same way as the controller in the kernel
Follow-up to 5bf8002a3a.
2015-09-08 18:15:50 +02:00
Lennart Poettering 19af675e99 cgroups: delegation to unprivileged services is safe in the unified hierarchy
Delegation to unpriviliged processes is safe in the unified hierarchy,
hence allow it. This has the benefit of permitting "systemd --user"
instances to further partition their resources between user services.
2015-09-04 09:23:07 +02:00
Lennart Poettering b3ac818be8 core: split up manager_get_unit_by_pid()
Let's move the actual cgroup part of it into a new separate function
manager_get_unit_by_pid_cgroup(), and then make
manager_get_unit_by_pid() just a wrapper that also checks the two pid
hashmaps.

Then, let's make sure the various calls that want to deliver events to
the owners of a PID check both hashmaps and the cgroup and deliver the
event to *each* of them. OTOH make sure bus calls like GetUnitByPID()
continue to check the PID hashmaps first and the cgroup only as
fallback.
2015-09-04 09:07:31 +02:00
Lennart Poettering fea72cc033 macro: introduce new PID_TO_PTR macros and make use of them
This adds a new PID_TO_PTR() macro, plus PTR_TO_PID() and makes use of
it wherever we maintain processes in a hash table. Previously we
sometimes used LONG_TO_PTR() and other times ULONG_TO_PTR() for that,
hence let's make this more explicit and clean up things.
2015-09-04 09:07:30 +02:00
Thomas Hindoe Paaboel Andersen b3c5bad3d6 tree-wide: fix indentation 2015-09-02 20:46:42 +02:00
Lennart Poettering efdb02375b core: unified cgroup hierarchy support
This patch set adds full support the new unified cgroup hierarchy logic
of modern kernels.

A new kernel command line option "systemd.unified_cgroup_hierarchy=1" is
added. If specified the unified hierarchy is mounted to /sys/fs/cgroup
instead of a tmpfs. No further hierarchies are mounted. The kernel
command line option defaults to off. We can turn it on by default as
soon as the kernel's APIs regarding this are stabilized (but even then
downstream distros might want to turn this off, as this will break any
tools that access cgroupfs directly).

It is possibly to choose for each boot individually whether the unified
or the legacy hierarchy is used. nspawn will by default provide the
legacy hierarchy to containers if the host is using it, and the unified
otherwise. However it is possible to run containers with the unified
hierarchy on a legacy host and vice versa, by setting the
$UNIFIED_CGROUP_HIERARCHY environment variable for nspawn to 1 or 0,
respectively.

The unified hierarchy provides reliable cgroup empty notifications for
the first time, via inotify. To make use of this we maintain one
manager-wide inotify fd, and each cgroup to it.

This patch also removes cg_delete() which is unused now.

On kernel 4.2 only the "memory" controller is compatible with the
unified hierarchy, hence that's the only controller systemd exposes when
booted in unified heirarchy mode.

This introduces a new enum for enumerating supported controllers, plus a
related enum for the mask bits mapping to it. The core is changed to
make use of this everywhere.

This moves PID 1 into a new "init.scope" implicit scope unit in the root
slice. This is necessary since on the unified hierarchy cgroups may
either contain subgroups or processes but not both. PID 1 hence has to
move out of the root cgroup (strictly speaking the root cgroup is the
only one where processes and subgroups are still allowed, but in order
to support containers nicey, we move PID 1 into the new scope in all
cases.) This new unit is also used on legacy hierarchy setups. It's
actually pretty useful on all systems, as it can then be used to filter
journal messages coming from PID 1, and so on.

The root slice ("-.slice") is now implicitly created and started (and
does not require a unit file on disk anymore), since
that's where "init.scope" is located and the slice needs to be started
before the scope can.

To check whether we are in unified or legacy hierarchy mode we use
statfs() on /sys/fs/cgroup. If the .f_type field reports tmpfs we are in
legacy mode, if it reports cgroupfs we are in unified mode.

This patch set carefuly makes sure that cgls and cgtop continue to work
as desired.

When invoking nspawn as a service it will implicitly create two
subcgroups in the cgroup it is using, one to move the nspawn process
into, the other to move the actual container processes into. This is
done because of the requirement that cgroups may either contain
processes or other subgroups.
2015-09-01 23:52:27 +02:00
Lennart Poettering 5fe8876b32 core: when looking for the unit for a process, look at the PID hashmaps first
It's cheaper that going to cgroupfs, and also usually the better choice
since it's not racy and can map PIDs even if they were moved to a
different unit.
2015-09-01 18:47:46 +02:00
Lennart Poettering 6f883237f1 cgroup: drop "ignore_self" argument from cg_is_empty()
In all cases where the function (or cg_is_empty_recursive()) ignoring
the calling process is actually wrong, as a process keeps a cgroup busy
regardless if its the current one or another. Hence, let's simplify
things and drop the "ignore_self" parameter.
2015-09-01 18:37:01 +02:00
Lennart Poettering e9db43d591 units: enable waiting for unit termination in certain cases
The legacy cgroup hierarchy does not support reliable empty
notifications in containers and if there are left-over subgroups in a
cgroup. This makes it hard to correctly wait for them running empty, and
thus we previously disabled this logic entirely.

With this change we explicitly check for the container case, and whether
the unit is a "delegation" unit (i.e. one where programs may create
their own subgroups). If we are neither in a container, nor operating on
a delegation unit cgroup empty notifications become reliable and thus we
start waiting for the empty notifications again.

This doesn't really fix the general problem around cgroup notifications
but reduces the effect around it.

(This also reorders #include lines by their focus, as suggsted in
CODING_STYLE. We have to add "virt.h", so let's do that at the right
place.)

Also see #317.
2015-09-01 17:44:17 +02:00
Lennart Poettering 35b7ff80e2 unit: add new macros to test for unit contexts 2015-08-31 13:20:43 +02:00
Lennart Poettering b2c23da8ce core: rename SystemdRunningAs to ManagerRunningAs
It's primarily just a property of the Manager object after all, and we
try to refer to PID 1 as "manager" instead of "systemd", hence let's to
stick to this here too.
2015-05-11 22:51:49 +02:00
Ronny Chevalier 0b452006de shared: add process-util.[ch] 2015-04-10 23:54:49 +02:00
Lennart Poettering 5ad096b3f1 core: expose consumed CPU time per unit
This adds support for showing the accumulated consumed CPU time per-unit
in the "systemctl status" output. The property is also readable via the
bus.
2015-03-02 12:15:25 +01:00
Zbigniew Jędrzejewski-Szmek a3bd89ea99 core/cgroup: fix embarrassing typo
https://github.com/docker/docker/issues/10280
2015-01-31 23:03:56 -05:00
Torstein Husebø cc98b3025e treewide: fix multiple typos 2015-01-26 10:39:47 -05:00
Daniel Mack 71c2687360 cgroup: fix typo 2015-01-19 18:34:17 +01:00
Zbigniew Jędrzejewski-Szmek 7539904965 cgroup: memory limits on / are not supported 2015-01-05 19:04:10 -05:00
Zbigniew Jędrzejewski-Szmek 6da139137e cgroup: fix error message
systemd[1]: Failed to set memory.limit_in_bytes on : Invalid argument
2015-01-05 19:04:10 -05:00
Lennart Poettering 714e2e1d56 cgroup: downgrade log messages when we cannot write to cgroup trees that are mounted read-only 2015-01-05 01:40:51 +01:00
Lennart Poettering 7b3fd6313c scope: make attachment of initial PIDs a bit more robust 2014-12-10 22:06:44 +01:00
Lennart Poettering 0cd385d318 core: don't migrate PIDs for units that may contain subcgroups, do this only for leaf units
Otherwise a slice or delegation unit might move PIDs around ignoring the
fact that it is attached to a subcgroup.
2014-12-10 20:38:24 +01:00
Lennart Poettering b1491eba40 core: rename unit_destroy_cgroup() to unit_destroy_cgroup_if_empty() since it's not quite as destructive as it sounds nowadays 2014-12-09 02:31:42 +01:00
Ross Lagerwall dab5bf8599 cgroup: Handle error when destroying cgroup
If a cgroup fails to be destroyed (most likely because there are still
processes running as part of a service after the main pid exits), don't
free and remove the cgroup unit from the manager.  This fixes a
regression introduced by the cgroup rework in v205 where systemd would
forget about processes still running after the unit becomes inactive.
(This can happen when the main pid exits and KillMode=process or none).
2014-12-09 02:28:09 +01:00
Michal Schmidt 4a62c710b6 treewide: another round of simplifications
Using the same scripts as in f647962d64 "treewide: yet more log_*_errno
+ return simplifications".
2014-11-28 19:57:32 +01:00
Michal Schmidt 56f64d9576 treewide: use log_*_errno whenever %m is in the format string
If the format string contains %m, clearly errno must have a meaningful
value, so we might as well use log_*_errno to have ERRNO= logged.

Using:
find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\((".*%m.*")/log_\1_errno(errno, \2/'

Plus some whitespace, linewrap, and indent adjustments.
2014-11-28 19:49:27 +01:00
Michal Schmidt 23bbb0de4e treewide: more log_*_errno + return simplifications 2014-11-28 18:24:30 +01:00
Michal Schmidt da927ba997 treewide: no need to negate errno for log_*_errno()
It corrrectly handles both positive and negative errno values.
2014-11-28 13:29:21 +01:00
Michal Schmidt 0a1beeb642 treewide: auto-convert the simple cases to log_*_errno()
As a followup to 086891e5c1 "log: add an "error" parameter to all
low-level logging calls and intrdouce log_error_errno() as log calls
that take error numbers", use sed to convert the simple cases to use
the new macros:

find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\("(.*)%s"(.*), strerror\(-([a-zA-Z_]+)\)\);/log_\1_errno(-\4, "\2%m"\3);/'

Multi-line log_*() invocations are not covered.
And we also should add log_unit_*_errno().
2014-11-28 12:04:41 +01:00
Lennart Poettering a931ad47a8 core: introduce new Delegate=yes/no property controlling creation of cgroup subhierarchies
For priviliged units this resource control property ensures that the
processes have all controllers systemd manages enabled.

For unpriviliged services (those with User= set) this ensures that
access rights to the service cgroup is granted to the user in question,
to create further subgroups. Note that this only applies to the
name=systemd hierarchy though, as access to other controllers is not
safe for unpriviliged processes.

Delegate=yes should be set for container scopes where a systemd instance
inside the container shall manage the hierarchies below its own cgroup
and have access to all controllers.

Delegate=yes should also be set for user@.service, so that systemd
--user can run, controlling its own cgroup tree.

This commit changes machined, systemd-nspawn@.service and user@.service
to set this boolean, in order to ensure that container management will
just work, and the user systemd instance can run fine.
2014-11-05 18:49:14 +01:00
Zbigniew Jędrzejewski-Szmek b1d6dcf5a5 Do not format USEC_INFINITY as NULL
systemctl would print 'CPUQuotaPerSecUSec=(null)' for no limit. This
does not look right.

Since USEC_INFINITY is one of the valid values, format_timespan()
could return NULL, and we should wrap every use of it in strna() or
similar. But most callers didn't do that, and it seems more robust to
return a string ("infinity") that makes sense most of the time, even
if in some places the result will not be grammatically correct.
2014-09-29 11:09:39 -04:00
Lennart Poettering d81afec1c9 core: split up "starting" manager state into "initializing" and "starting"
We'll stay in "initializing" until basic.target has reached, at which
point we will enter "starting".

This is preparation so that we can change the startip timeout to only
apply to the first phase of startup, not the full procedure.
2014-08-22 18:10:31 +02:00
Lennart Poettering 1aeab12b19 cgroup: only generate warnings if actually writing to cgroup attributes failed 2014-08-15 18:14:37 +02:00
Lennart Poettering 6b2f67b31c cgroup: downgrade log messages about non-existant cgroup attributes to LOG_DEBUG 2014-08-15 11:57:07 +02:00
Kay Sievers 3a43da2832 time-util: add and use USEC/NSEC_INFINIY 2014-07-29 13:20:20 +02:00
Zbigniew Jędrzejewski-Szmek 0d8c31ff72 test-engine: fix access to unit load path
Also add a bit of debugging output to help diagnose problems,
add missing units, and simplify cppflags.

Move test-engine to normal tests from manual tests, it should now
work without destroying the system.
2014-07-20 19:48:16 -04:00
Lennart Poettering 9a05490933 cgroups: simplify CPUQuota= logic
Only accept cpu quota values in percentages, get rid of period
definition.

It's not clear whether the CFS period controllable per-cgroup even has a
future in the kernel, hence let's simplify all this, hardcode the period
to 100ms and only accept percentage based quota values.
2014-05-22 11:53:12 +09:00
Lennart Poettering 637f421e5c cgroups: always propagate controller membership to siblings, for all controllers
This is the behaviour the kernel cgroup rework exposes for all
controllers, hence let's do this already now for all cases.
2014-05-22 07:50:03 +09:00
Lennart Poettering db785129c9 cgroup: rework startup logic
Introduce a (unsigned long) -1 as "unset" state for cpu shares/block io
weights, and keep the startup unit set around all the time.
2014-05-22 07:13:56 +09:00
WaLyong Cho 95ae05c0e7 core: add startup resource control option
Similar to CPUShares= and BlockIOWeight= respectively. However only
assign the specified weight during startup. Each control group
attribute is re-assigned as weight by CPUShares=weight and
BlockIOWeight=weight after startup.  If not CPUShares= or
BlockIOWeight= be specified, then the attribute is re-assigned to each
default attribute value. (default cpu.shares=1024, blkio.weight=1000)
If only CPUShares=weight or BlockIOWeight=weight be specified, then
that implies StartupCPUShares=weight and StartupBlockIOWeight=weight.
2014-05-22 07:13:56 +09:00
Łukasz Stelmach cd7affaeea core: check the right variable for failed open() 2014-05-08 13:24:34 +02:00
Kay Sievers 99a17ada9c core: require cgroups filesystem to be available
We should no longer pretend that we can run in any sensible way
without the kernel supporting us with cgroups functionality.
2014-05-05 18:52:36 +02:00
Lennart Poettering b2f8b02ec2 core: expose CFS CPU time quota as high-level unit properties 2014-04-25 13:27:25 +02:00
Lennart Poettering 7d711efb9c core: make sure we can combine DevicePolicy=closed with PrivateDevices=yes
if PrivateDevices=yes is used we need to make sure we can still
create /dev/null and so on.
2014-03-19 22:00:43 +01:00
Lennart Poettering 03e334a1c7 util: replace close_nointr_nofail() by a more useful safe_close()
safe_close() automatically becomes a NOP when a negative fd is passed,
and returns -1 unconditionally. This makes it easy to write lines like
this:

        fd = safe_close(fd);

Which will close an fd if it is open, and reset the fd variable
correctly.

By making use of this new scheme we can drop a > 200 lines of code that
was required to test for non-negative fds or to reset the closed fd
variable afterwards.
2014-03-18 19:31:34 +01:00
Lennart Poettering e41969e3d1 core: support globbing matches in DeviceAllow= when checking for device groups 2014-03-11 17:43:41 +01:00
Lennart Poettering 01efdf13a6 cgroup: certain cgroup attributes are not available in the root cgroup, hence don't bother 2014-02-24 03:38:58 +01:00
Lennart Poettering 90060676c4 cgroup: Extend DeviceAllow= syntax to whitelist groups of devices, not just particular devices nodes 2014-02-22 03:05:34 +01:00
Lennart Poettering d4fdc205a4 update TODO 2014-02-19 18:20:12 +01:00
Jan Engelhardt 73e231abde doc: update punctuation
Resolve spotted issues related to missing or extraneous commas, dashes.
2014-02-17 19:03:07 -05:00
Lennart Poettering 03b90d4bad core: find the closest parent slice that has a specfic cgroup controller enabled when enabling/disabling cgroup controllers for units 2014-02-17 15:49:21 +01:00
Lennart Poettering bc432dc7eb core: rework cgroup mask propagation
Previously a cgroup setting down tree would result in cgroup membership
additions being propagated up the tree and to the siblings, however a
unit could never lose cgroup memberships again. With this change we'll
make sure that both cgroup additions and removals propagate properly.
2014-02-17 15:49:21 +01:00
David Strauss 6414b7c981 cgroups: Cache controller masks and optimize queues. 2013-11-22 11:22:47 +10:00
Zbigniew Jędrzejewski-Szmek a94042fa9b systemd: fix memory leak in cgroup code
If the unit already was in the hashmap, path would be leaked.
2013-11-09 19:02:53 -05:00
David Strauss f366954523 Comment spelling fixes. 2013-11-06 20:03:18 +10:00
Lennart Poettering 15c60e99a9 cgroup: run PID 1 in the root cgroup
This way cleaning up the cgroup tree on shutdown is a lot easier since
we are in the root dir. Also PID 1 was previously artificially placed in
system.slice, even though our rule actually was not to have processes in
slices. The root slice otoh is magic anyway, so having PID 1 in there
sounds less surprising.

Of course, this means that PID is scheduled against the three top-level
slices.
2013-11-06 02:12:21 +01:00
Lennart Poettering 71fda00f32 list: make our list macros a bit easier to use by not requring type spec on each invocation
We can determine the list entry type via the typeof() gcc construct, and
so we should to make the macros much shorter to use.
2013-10-14 06:11:19 +02:00
Lennart Poettering 13b84ec7df cgroup: if we do a cgroup operation then do something on all supported controllers
Previously we did operations like attach, trim or migrate only on the
controllers that were enabled for a specific unit. With this changes we
will now do them for all supproted controllers, and fall back to all
possible prefix paths if the specified paths do not exist.

This fixes issues if a controller is being disabled for a unit where it
was previously enabled, and makes sure that all processes stay as "far
down" the tree as groups exist.
2013-09-25 03:38:17 +02:00
Lennart Poettering e58cec11e6 cgroup: always enable memory.use_hierarchy= for all cgroups in the memory hierarchy
The non-hierarchial mode contradicts the whole idea of a cgroup tree so
let's not support this. In the future the kernel will only support the
hierarchial logic anyway.
2013-09-23 16:02:31 -05:00
Lennart Poettering ddca82aca0 cgroup: get rid of MemorySoftLimit=
The cgroup attribute memory.soft_limit_in_bytes is unlikely to stay
around in the kernel for good, so let's not expose it for now. We can
readd something like it later when the kernel guys decided on a final
API for this.
2013-09-17 14:58:00 -05:00
Gao feng 112a7f4696 cgroup: add missing equals for BlockIOWeight 2013-09-16 09:19:00 -04:00
Lukas Nykryn 81c68af03f core/cgroup: first print then free 2013-09-13 14:40:58 +02:00
Gao feng 6a94f2e938 cgroup: fix incorrectly setting memory cgroup
If the memory_limit of unit is -1, we should write "-1"
to the file memory.limit_in_bytes. not the (unit64_t) -1.

otherwise the memory.limit_in_bytes will be set to zero.
2013-09-13 14:32:14 +02:00
Gao feng 84121bc2ee cgroup: correct the log information
it should be memory.soft_limit_in_bytes.
2013-09-13 14:32:14 +02:00
Gao feng 15b4a7548f cgroup: add the missing setting of variable's value
set the value of variable "r" to the return value
of cg_set_attribute.
2013-09-13 14:32:14 +02:00
Harald Hoyer b58b8e11c5 Do not realloc strings, which are already in the hashmap as keys
This prevents corruption of the hashmap, because we would free() the
keys in the hashmap, if the unit is already in there, with the same
cgroup path.
2013-08-28 16:02:57 +02:00
Harald Hoyer 3d040cf244 Revert "cgroup.c: check return value of unit_realize_cgroup_now()"
This reverts commit 1f11a0cdfe.
2013-08-28 16:02:39 +02:00
Harald Hoyer 1f11a0cdfe cgroup.c: check return value of unit_realize_cgroup_now()
do not recurse further, if unit_realize_cgroup_now() failed
2013-08-23 18:46:51 +02:00
Lennart Poettering 8e7076caae cgroup: split out per-device BlockIOWeight= setting into BlockIODeviceWeight=
This way we can nicely map the configuration directive to properties and
back, without requiring two different signatures for the same property.
2013-07-11 20:40:18 +02:00
Lennart Poettering 8a84192905 cgroup: don't ever try to destroy the cgroup of the root slice
The root slice is after all the root cgroup, so don't attempt to delete
it.
2013-07-11 18:49:52 +02:00
Lennart Poettering be2c1bd2a8 cgroup: don't move systemd into systems.slice when running as --user instance 2013-07-11 18:49:52 +02:00
Lennart Poettering 376dd21dc0 cgroup: downgrade error message when we cannot remove a cgroup to debug
Some units set KillMode=none to survive the initrd→rootfs transition. We
cannot remove their cgroups, but that shouldn't really be considered an
issue, so let's downgrade the error message.
2013-07-10 23:41:03 +02:00
Lennart Poettering 06025d9148 core: don't consider a unit's cgroup empty if only a subcgroup runs empty 2013-07-02 16:24:13 +02:00
Lennart Poettering b56c28c31a cgroup: implicitly add units to GC queue when their cgroups run empty 2013-07-01 00:17:59 +02:00
Lennart Poettering 0a1eb06d9a cgroup: readd proper cgroup empty tracking 2013-07-01 00:17:59 +02:00
Lennart Poettering 4ad490007b core: general cgroup rework
Replace the very generic cgroup hookup with a much simpler one. With
this change only the high-level cgroup settings remain, the ability to
set arbitrary cgroup attributes is removed, so is support for adding
units to arbitrary cgroup controllers or setting arbitrary paths for
them (especially paths that are different for the various controllers).

This also introduces a new -.slice root slice, that is the parent of
system.slice and friends. This enables easy admin configuration of
root-level cgrouo properties.

This replaces DeviceDeny= by DevicePolicy=, and implicitly adds in
/dev/null, /dev/zero and friends if DeviceAllow= is used (unless this is
turned off by DevicePolicy=).
2013-06-27 04:17:34 +02:00
Lennart Poettering 9444b1f20e logind: add infrastructure to keep track of machines, and move to slices
- This changes all logind cgroup objects to use slice objects rather
  than fixed croup locations.

- logind can now collect minimal information about running
  VMs/containers. As fixed cgroup locations can no longer be used we
  need an entity that keeps track of machine cgroups in whatever slice
  they might be located. Since logind already keeps track of users,
  sessions and seats this is a trivial addition.

- nspawn will now register with logind and pass various bits of metadata
  along. A new option "--slice=" has been added to place the container
  in a specific slice.

- loginctl gained commands to list, introspect and terminate machines.

- user.slice and machine.slice will now be pulled in by logind.service,
  since only logind.service requires this slice.
2013-06-20 03:49:59 +02:00
Lennart Poettering a016b9228f core: add new .slice unit type for partitioning systems
In order to prepare for the kernel cgroup rework, let's introduce a new
unit type to systemd, the "slice". Slices can be arranged in a tree and
are useful to partition resources freely and hierarchally by the user.

Each service unit can now be assigned to one of these slices, and later
on login users and machines may too.

Slices translate pretty directly to the cgroup hierarchy, and the
various objects can be assigned to any of the slices in the tree.
2013-06-17 21:36:51 +02:00
Lennart Poettering 7027ff61a3 nspawn: introduce the new /machine/ tree in the cgroup tree and move containers there
Containers will now carry a label (normally derived from the root
directory name, but configurable by the user), and the container's root
cgroup is /machine/<label>. This label is called "machine name", and can
cover both containers and VMs (as soon as libvirt also makes use of
/machine/).

libsystemd-login can be used to query the machine name from a process.

This patch also includes numerous clean-ups for the cgroup code.
2013-04-16 04:41:21 +02:00
Lennart Poettering a32360f1a5 core: always create /user and /machine top-level cgroup dirs
This allows clients to put inotify watches on these trees to watch for
state changes, without having to wait until these dirs are created.

This introduces the new top-level /machine cgroup dir as canonical
location where OS containers and VMs shall be located (as discussed with
the libvirt folks).
2013-04-15 21:59:04 +02:00
Lennart Poettering 974efc4658 cgroup: always keep access mode of 'tasks' and 'cgroup.procs' files in cgroup directories in sync 2013-04-08 18:22:47 +02:00
Lennart Poettering 8e70580bb0 cgroup: minor optimization 2013-03-22 15:46:49 +01:00
Lennart Poettering 246aa6dd9d core: add bus API and systemctl commands for altering cgroup parameters during runtime 2013-01-14 21:24:57 +01:00
Zbigniew Jędrzejewski-Szmek 67445f4e22 core: move ManagerRunningAs to shared
Note: I did s/MANAGER/SYSTEMD/ everywhere, even though it makes the
patch quite verbose. Nevertheless, keeping MANAGER prefix in some
places, and SYSTEMD prefix in others would just lead to confusion down
the road. Better to rip off the band-aid now.
2012-09-18 19:53:34 +02:00
Shawn Landden 0d0f0c50d3 log.h: new log_oom() -> int -ENOMEM, use it
also a number of minor fixups and bug fixes: spelling, oom errors
that didn't print errors, not properly forwarding error codes,
few more consistency issues, et cetera
2012-07-26 11:48:26 +02:00
Shawn Landden 669241a076 use "Out of memory." consistantly (or with "\n")
glibc/glib both use "out of memory" consistantly so maybe we should
consider that instead of this.

Eliminates one string out of a number of binaries. Also fixes extra newline
in udev/scsi_id
2012-07-25 11:23:57 +02:00
Lennart Poettering b7def68494 util: rename join() to strjoin()
This is to match strappend() and the other string related functions.
2012-07-13 13:41:01 +02:00
Kay Sievers 9eb977db5b util: split-out path-util.[ch] 2012-05-08 02:33:10 +02:00
Lennart Poettering 88f3e0c91f service: explicitly remove control/ subcgroup after each control command
The kernel will only notify us of cgroups running empty if no subcgroups
exist anymore. Hence make sure we don't leave our own control/ subcgroup
around longer than necessary.

https://bugzilla.redhat.com/show_bug.cgi?id=818381
2012-05-03 21:54:44 +02:00
Lennart Poettering b59e246565 logind: remove redundant entries from logind's default controller lists too 2012-04-16 19:15:00 +02:00
Lennart Poettering 9156e799a2 manager: remove unavailable/redundant entries from default controllers list 2012-04-16 18:59:07 +02:00
Lennart Poettering 3474ae3c7e cgroup: if a controller is not available don't try to create cgroups in its hierarchy 2012-04-16 18:59:07 +02:00
Lennart Poettering ecedd90fcd service: place control command in subcgroup control/
Previously, we were brutally and onconditionally killing all processes
in a service's cgroup before starting the service anew, in order to
ensure that StartPre lines cannot be misused to spawn long-running
processes.

On logind-less systems this has the effect that restarting sshd
necessarily calls all active ssh sessions, which is usually not
desirable.

With this patch control processes for a service are placed in a
sub-cgroup called "control/". When starting a service anew we simply
kill this cgroup, but not the main cgroup, in order to avoid killing any
long-running non-control processes from previous runs.

https://bugzilla.redhat.com/show_bug.cgi?id=805942
2012-04-13 23:29:59 +02:00
Lennart Poettering 5430f7f2bc relicense to LGPLv2.1 (with exceptions)
We finally got the OK from all contributors with non-trivial commits to
relicense systemd from GPL2+ to LGPL2.1+.

Some udev bits continue to be GPL2+ for now, but we are looking into
relicensing them too, to allow free copy/paste of all code within
systemd.

The bits that used to be MIT continue to be MIT.

The big benefit of the relicensing is that closed source code may now
link against libsystemd-login.so and friends.
2012-04-12 00:24:39 +02:00
Kay Sievers b30e2f4c18 move libsystemd_core.la sources into core/ 2012-04-11 16:03:51 +02:00
Renamed from src/cgroup.c (Browse further)