We want it for global variables, which LLVM supports and GCC currently
does not (GCC does support it for functions, but we care about global
variables here).
Why is this relevant? When asan is used global variables are padded with
hotzones before and after. But we can't have that for the registration
variables we place in special ELF sections: we want them tightly packed
so that we can iterate through them.
Note that for gcc this isn't an issue, as it will pack stuff in
non-standard sections anyway, even if asan is used.
This cleans up a bit how we set up things for the ELF section magic:
1. Let's always use our gcc macros, instead of __attribute__ directly
2. Align our structures to sizeof(void*), i.e. the pointer size, rather
than a fixed 8 or __BIGGEST_ALIGNMENT__. The former is unnecessarily
high for 32bit systems, the latter too high for 64bit systems. gcc
seems to use ptr alignment for static variables itself, hence this
should be good enough for us too. Moreover, the Linux kernel also
uses pointer alginment for all its ELF section registration magic,
hence this should be good enough for us too.
3. Let's always prefix the sections we create ourself with SYSTEMD_,
just to make clear where they come from.
4. Always align the pointer we start from when iterating through these
lists. This should be unnecessary, but makes things nicely
systematic, as we'll align all pointers we use to access these
sections properly.
We follow no general rule, but in most cases we do not place a space
outside of macro.h. Hence let's stick to that, and adapt macro.h too,
and follow the rule systematically that there shall not be a space
between __attribute__ and ((...
Yes, this does not matter at all, and is purely OCD cosmetics. But then
again, the uses of __attribute__ are very local only, hence the changes
cleaning this up are small and are unlikely to have to be repeated too
often...
This splits out a bunch of functions from fileio.c that have to do with
temporary files. Simply to make the header files a bit shorter, and to
group things more nicely.
No code changes, just some rearranging of source files.
Whenever we invoke external, foreign code from code that has
RLIMIT_NOFILE's soft limit bumped to high values, revert it to 1024
first. This is a safety precaution for compatibility with programs using
select() which cannot operate with fds > 1024.
This commit adds the call to rlimit_nofile_safe() to all invocations of
exec{v,ve,l}() and friends that either are in code that we know runs
with RLIMIT_NOFILE bumped up (which is PID 1 and all journal code for
starters) or that is part of shared code that might end up there.
The calls are placed as early as we can in processes invoking a flavour
of execve(), but after the last time we do fd manipulations, so that we
can still take benefit of the high fd limits for that.
This is really basic stuff and in a follow-up commit will use it all
across the codebase, including in process-util.[ch] which is in
src/basic/. Hence let's move it back to src/basic/ itself.
We are pretty careful to reject abstract sockets that are too long to fit in
the address structure as a NUL-terminated string. And since we parse sockets as
strings, it is not possible to embed a NUL in the the address either. But we
might receive an external socket (abstract or not), and we want to be able to
print its address in all cases. We would call socket_address_verify() and
refuse to print various sockets that the kernel considers legit.
Let's do the strict verification only in case of socket addresses we parse and
open ourselves, and do less strict verification when printing addresses of
existing sockets, and use c-escaping to print embedded NULs and such.
More tests are added.
This should make LGTM happier because on FIXME comment is removed.
It's pretty useful to allow parse_boolean() to take a NULL argument and
return an error in that case, rather than abort. i.e. making this a
runtime rather than programming error allows us to shorten code
elsewhere.
This flag can be used to make chase_symlinks() emit a warning when it
encounters an error.
Such flag can be useful for generating a comprehensive and detailed warning
since chase_symlinks() can generate a warning with a full context.
For now only warnings for unsafe transitions are produced.
device_path_make_{major_minor|canonical) generate device node paths
given a mode_t and a dev_t. We have similar code all over the place,
let's unify this in one place. The former will generate a "/dev/char/"
or "/dev/block" path, and never go to disk. The latter then goes to disk
and resolves that path to the actual path of the device node.
device_path_parse_major_minor() reverses device_path_make_major_minor(),
also withozut going to disk.
We have similar code doing something like this at various places, let's
unify this in a single set of functions. This also allows us to teach
them special tricks, for example handling of the
/run/systemd/inaccessible/{blk|chr} device nodes, which we use for
masking device nodes, and which do not exist in /dev/char/* and
/dev/block/*
We should probably drop path_join() entirely in the long run (and
then rename path_join_many() to it?), but for now let's make one a
wrapper for the other.
Let's be a bit more careful when parsing major/minor pairs, and filter
out more corner cases. This also means using safe_atou() rather than
sscanf() to avoid weird negative unsigned handling and such.
As it turns out glibc and the Linux kernel have different ideas about
the size of dev_t and how many bits exist for the major and the minor.
When validating major/minor numbers we should check against the kernel's
actual sizes, hence add macros for this.
Let's simplify things and drop the logic that /var/lib/machines is setup
as auto-growing btrfs loopback file /var/lib/machines.raw.
THis was done in order to make quota available for machine management,
but quite frankly never really worked properly, as we couldn't grow the
file system in sync with its use properly. Moreover philosophically it's
problematic overriding the admin's choice of file system like this.
Let's hence drop this, and simplify things. Deleting code is a good
feeling.
Now that regular file systems provide project quota we could probably
add per-machine quota support based on that, hence the btrfs quota
argument is not that interesting anymore (though btrfs quota is a bit
more powerful as it allows recursive quota, i.e. that the machine pool
gets an overall quota in addition to per-machine quota).
Fixes: #2728
This is also supposed to be preparation for doing #10234 eventually,
where a very similar operation is requested: instead of importing a tree
to /var/lib/machines it would need to be imported into
/var/lib/portables/.
This adds two optional functions that may be passed to the various copy
functions. One is invoked whenever we start copying a new file object,
the other while we copy file payload in each loop iteration.
When the caller passes one or both they can get notifications about copy
progress, for example to log where things are.
This is to startswith() what PATH_STARTSWITH_SET() is to
path_startswith().
Or in other words, checks if the specified string has any of the listed
prefixes, and if so, returns the remainder of the string.
This changes cg_enable_everywhere() to return which controllers are
enabled for the specified cgroup. This information is then used to
correctly track the enablement mask currently in effect for a unit.
Moreover, when we try to turn off a controller, and this works, then
this is indicates that the parent unit might succesfully turn it off
now, too as our unit might have kept it busy.
So far, when realizing cgroups, i.e. when syncing up the kernel
representation of relevant cgroups with our own idea we would strictly
work from the root to the leaves. This is generally a good approach, as
when controllers are enabled this has to happen in root-to-leaves order.
However, when controllers are disabled this has to happen in the
opposite order: in leaves-to-root order (this is because controllers can
only be enabled in a child if it is already enabled in the parent, and
if it shall be disabled in the parent then it has to be disabled in the
child first, otherwise it is considered busy when it is attempted to
remove it in the parent).
To make things complicated when invalidating a unit's cgroup membershup
systemd can actually turn off some controllers previously turned on at
the very same time as it turns on other controllers previously turned
off. In such a case we have to work up leaves-to-root *and*
root-to-leaves right after each other. With this patch this is
implemented: we still generally operate root-to-leaves, but as soon as
we noticed we successfully turned off a controller previously turned on
for a cgroup we'll re-enqueue the cgroup realization for all parents of
a unit, thus implementing leaves-to-root where necessary.
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.
I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
Commit 100d5f6ee6 (user-util: add new wrappers for [...] database
files), ammended by commit 4f07ffa8f5 (Use #if instead of #ifdef for
ENABLE_GSHADOW) moved code from sysuser to basic/user-util.
In doing so, the combination of both commits properly propagated the
ENABLE_GSHADOW conditions around the function manipulating gshadow, but
they forgot to protect the inclusion of the gshadow.h header.
Fix that to be able to build on C libraries that do not provide gshadow
(e.g. uClibc-ng, where it does not exist.)
This changes DEFINE_MAIN_FUNCTION_WITH_POSITIVE_FAILURE() to propagate positive
return values as they were, i.e. stops mapping them all to EXIT_FAILURE. This
was suggested in review, but I thought that we only ever return EXIT_FAILURE,
so we don't need to propagate multiple return values.
I was wrong. Turns out that we already *do* have multiple positive return
values, when we call external binaries and propagate the result. systemd-inhibit
is one example, and b453c447e0 actually broke
this propagation. This commit fixes it.
In systemd-fsck we have the opposite case: we have only one failure value, and the
code needs to be adjusted, so that it keeps returning EXIT_FAILURE.
All other users of DEFINE_MAIN_FUNCTION_WITH_POSITIVE_FAILURE() return <= 1, and
are unaffected by this change.
We generally want to close the pager last. This patch closes the pager last,
after the static destuctor calls. This means that they can do logging and such
like during normal program runtime.
This doesn't have much effect on the final build, because we link libbasic.a
into libsystemd-shared.so, so in the end, all the object built from basic/
end up in libsystemd-shared. And when the static library is linked into binaries,
any objects that are included in it but are not used are trimmed. Hence, the
size of output artifacts doesn't change:
$ du -sb /var/tmp/inst*
54181861 /var/tmp/inst1 (old)
54207441 /var/tmp/inst1s (old split-usr)
54182477 /var/tmp/inst2 (new)
54208041 /var/tmp/inst2s (new split-usr)
(The negligible change in size is because libsystemd-shared.so is bigger
by a few hundred bytes. I guess it's because symbols are named differently
or something like that.)
The effect is on the build process, in particular partial builds. This change
effectively moves the requirements on some build steps toward the leaves of the
dependency tree. Two effects:
- when building items that do not depend on libsystemd-shared, we
build less stuff for libbasic.a (which wouldn't be used anyway,
so it's a net win).
- when building items that do depend on libshared, we reduce libbasic.a as a
synchronization point, possibly allowing better parallelism.
Method:
1. copy list of .h files from src/basic/meson.build to /tmp/basic
2. $ for i in $(grep '.h$' /tmp/basic); do echo $i; git --no-pager grep "include \"$i\"" src/basic/ 'src/lib*' 'src/nss-*' 'src/journal/sd-journal.c' |grep -v "${i%.h}.c";echo ;done | less
Code was not doing a wait() after kill() due to checking for a return value > 0, and was leaving zombie processes. This affected things like sd-bus unixexec connections.
As suggest here:
https://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html#Attribute-Syntax
"You may optionally specify attribute names with ‘__’ preceding and
following the name. This allows you to use them in header files without
being concerned about a possible macro of the same name. For example,
you may use the attribute name __noreturn__ instead of noreturn. "
This way, we can extend the macro a bit with stuff pulled in from other
headers without this affecting everything which pulls in macro.h, which
is one of our most basic headers.
This is just refactoring, no change in behaviour, in prepartion for
later changes.
It was only used in one place, where we don't actually need it, and
it is too easy to forget to update it when adding new items to the table.
Let's just drop it.
systemd only uses functions that are as of Linux 4.15+ provided
externally to the CPU controller (currently usage_usec), so if we have a
new enough kernel, we don't need to set CGROUP_MASK_CPU for
CPUAccounting=true as the CPU controller does not need to necessarily be
enabled in this case.
Part of this patch is modelled on an earlier patch by Ryutaroh Matsumoto
(see PR #9665).
I decided to use a separate definition for this because it's too easy to return
positive from functions which don't need this distinction and only return
negative on error and success otherwise.
If we create a cgroup in one controller it might already have been
created in another too, if we have jointly mounted controllers. Take
that into consideration.
The function takes a pointer to a random block of memory and
the length of that block. It shouldn't crash every time it sees
a zero byte at the beginning there.
This should help the dev-kmsg fuzzer to keep going.
With gcc-7.1.1-3.fc26.aarch64:
../src/basic/json.c: In function ‘json_format’:
../src/basic/json.c:1409:40: warning: comparison is always true due to limited range of data type [-Wtype-limits]
if (*q >= 0 && *q < ' ')
^~
../src/basic/json.c: In function ‘inc_lines_columns’:
../src/basic/json.c:1762:31: warning: comparison is always true due to limited range of data type [-Wtype-limits]
} else if (*s >= 0 && *s < 127) /* Process ASCII chars quickly */
^~
Cast to (signed char) silences the warning, but a cast to (int) for some reason
doesn't.
Now that we don't (mis-)use the env file parser to parse kernel command
lines there's no need anymore to override the used newline character
set. Let's hence drop the argument and just "\n\r" always. This nicely
simplifies our code.
This introduces a wrapper around extrac_first_word() called
proc_cmdline_extract_first(), which suppresses "rd." parameters
depending on the specified calls.
This allows us to share more code between proc_cmdline_parse_given() and
proc_cmdline_get_key(), and makes it easier to reuse this logic for
other purposes.
Normally, we want to immediately quit on ^C. But when we are running under
less, people may set SYSTEMD_LESS without K, in which case they can use ^C to
communicate with less, and e.g. start and stop following input.
Fixes#6405.
All users of the macro (except for one, in serialize.c), use the macro in
connection with read_line(), so they must include fileio.h. Let's not play
libc games and require multiple header file to be included for the most common
use of a function.
The removal of def.h includes is not exact. I mostly went over the commits that
switch over to use read_line() and add def.h at the same time and reverted the
addition of def.h in those files.
Pretty much everything uses just the first argument, and this doesn't make this
common pattern more complicated, but makes it simpler to pass multiple options.
This makes DEPTH_MAX lower value, as test-json fails with stack
overflow.
Note that the test can pass with 8k, but for safety, here set to 4k.
Fixes#10738.
This helper is useful to ensure pidns/userns joining is properly
executed (as that requires a fork after the setns()). This is
particularly important when it comes to /proc/self/ access or
SCM_CREDENTIALS, but is generally the safer mode of operation.
Rename rdrand64 to rdrand, and switch from uint64_t to unsigned long.
This produces code that will compile/assemble on both x86-64 and x86-32.
This could be useful when running a 32-bit copy of systemd on a modern
Intel processor.
RDRAND is inherently arch-specific, so relying on the compiler-defined
'long' type seems reasonable.
We only use this when we don't require the best randomness. The primary
usecase for this is UUID generation, as this means we don't drain
randomness from the kernel pool for them. Since UUIDs are usually not
secrets RDRAND should be goot enough for them to avoid real-life
collisions.
Originally, the high_quality_required boolean argument controlled two
things: whether to extend any random data we successfully read with
pseudo-random data, and whether to return -ENODATA if we couldn't read
any data at all.
The boolean got replaced by RANDOM_EXTEND_WITH_PSEUDO, but this name
doesn't really cover the second part nicely. Moreover hiding both
changes of behaviour under a single flag is confusing. Hence, let's
split this part off under a new flag, and use it from random_bytes().
This should normally not happen, but given that the man page suggests
something about this in the context of interruption, let's handle this
and propagate an I/O error.
It's more descriptive, since we also have a function random_bytes()
which sounds very similar.
Also rename pseudorandom_bytes() to pseudo_random_bytes(). This way the
two functions are nicely systematic, one returning genuine random bytes
and the other pseudo random ones.
It's cheap to get RDRAND and given that srand() is anyway not really
useful for trusted randomness let's use RDRAND for it, after all we have
all the hard work for that already in place.
Otherwise doing comparing a CGroupMask (which is unsigned in effect)
with the result of CGROUP_CONTROLLER_TO_MASK() will result in warnings
about signedness differences.
We now have the "BPF" pseudo-controllers. These should never be assumed
to be accessible as /sys/fs/cgroup/<controller> and not through
"cgroup.subtree_control" either, hence always check explicitly before we
go to the file system. We do this through our new CGROUP_MASK_V1 and
CGROUP_MASK_V2 definitions.
Also, when we fail, don't clobber the return value.
This brings the call more in-line with our usual coding style, and
removes surprises.
None of the callers seemed to care about this behaviour.
This was mostly prompted by seeing the expression "in_initrd() && flags
& PROC_CMDLINE_RD_STRICT", which uses & and && without any brackets.
Let's make that a bit more readable and hide all doubts about operator
precedence.
Let's be more careful with what we serialize: let's ensure we never
serialize strings that are longer than LONG_LINE_MAX, so that we know we
can read them back with read_line(…, LONG_LINE_MAX, …) safely.
In order to implement this all serialization functions are move to
serialize.[ch], and internally will do line size checks. We'd rather
skip a serialization line (with a loud warning) than write an overly
long line out. Of course, this is just a second level protection, after
all the data we serialize shouldn't be this long in the first place.
While we are at it also clean up logging: while serializing make sure to
always log about errors immediately. Also, (void)ify all calls we don't
expect errors in (or catch errors as part of the general
fflush_and_check() at the end.
journald calls fd_get_path() a lot (it probably shouldn't, there's some
room for improvement there, but I'll leave that for another time), hence
it's worth optimizing the call a bit, in particular as it's easy.
Previously we'd open the dir /proc/self/fd/ first, before reading the
symlink inside it. This means the whole function requires three system
calls: open(), readlinkat(), close(). The reason for doing it this way
is to distinguish the case when we see ENOENT because /proc is not
mounted and the case when the fd doesn't exist.
With this change we'll directly go for the readlink(), and only if that
fails do an access() to see if /proc is mounted at all.
This optimizes the common case (where the fd is valid and /proc
mounted), in favour of the uncommon case (where the fd doesn#t exist or
/proc is not mounted).
I noticed while profiling journald that we invoke readlinkat() a ton on
open /proc/self/fd/<fd>, and that the returned paths are more often than
not longer than the 99 chars used before, when we look at archived
journal files. This means for these cases we generally need to execute
two rather than one syscalls.
Let's increase the buffer size a tiny bit, so that we reduce the number
of syscalls executed. This is really a low-hanging fruit of
optimization.
Our current set of flags allows an option to be either
use just in initrd or both in initrd and normal system.
This new flag is intended to be used in the case where
you want apply some settings just in initrd or just
in normal system.
This is useful for a couple of cases, I'm mostly interested in case #1:
1. Verifying "reasonable" values in a trivially scriptable way
2. Debugging unexpected time span parsing directly
Test Plan:
```
% build/systemd-analyze timespan 20
Original: 20
μs: 20
Human: 20us
% build/systemd-analyze timespan 20ms
Original: 20ms
μs: 20000
Human: 20ms
% build/systemd-analyze timespan 20z
Failed to parse time span '20z': Invalid argument
```
This is the counterpiece to the boot counting implemented in
systemd-boot: if a boot is detected as successful we mark drop the
counter again from the booted snippet or kernel image.
For the definition of assert_cc() we try to use static_assert and check
for it with "#ifdef". But that can only work if assert.h is imported
before. Hence let's do so.
This is a nice little optimization when using static const strings: we
can now use them directly as JsonVariant objecs, without any additional
allocation.
Simply as a safety precaution so that json objects we read are not
arbitrary amounts deep, so that code that processes json objects
recursively can't be easily exploited (by hitting stack limits).
Follow-up for oss-fuzz#10908
(Nice is that we can accomodate for this counter without increasing the
size of the JsonVariant object.)
Let's move things around a bit, so that the trailing unused whitespace
within the structure due to padding is placed together, so that it is
easier to use for new fields. (Found with pahole)
Let's clean up the failure codepaths, by using _cleanup_.
This relies on the new behaviour of env_append() introduced in the
previous commit that guarantess the list always remains properly NULL
terminated
Following discussions with some kernel folks at All Systems Go! it
appears that file descriptors are not really as expensive as they used
to be (both memory and performance-wise) and it should thus be OK to allow
programs (including unprivileged ones) to have more of them without ill
effects.
Unfortunately we can't just raise the RLIMIT_NOFILE soft limit
globally for all processes, as select() and friends can't handle fds
>= 1024, and thus unexpecting programs might fail if they accidently get
an fd outside of that range. We can however raise the hard limit, so
that programs that need a lot of fds can opt-in into getting fds beyond
the 1024 boundary, simply by bumping the soft limit to the now higher
hard limit.
This is useful for all our client code that accesses the journal, as the
journal merging logic might need a lot of fds. Let's add a unified
function for bumping the limit in a robust way.
This simply adds a new constant we can use for bumping RLIMIT_NOFILE to
a "high" value. It default to 256K for now, which is pretty high, but
smaller than the kernel built-in limit of 1M.
Previously, some tools that needed a higher RLIMIT_NOFILE bumped it to
16K. This new define goes substantially higher than this, following the
discussion with the kernel folks.
All over the place we define local variables for the various sockopts
that take a bool-like "int" value. Sometimes they are const, sometimes
static, sometimes both, sometimes neither.
Let's clean this up, introduce a common const variable "const_int_one"
(as well as one matching "const_int_zero") and use it everywhere, all
acorss the codebase.
The helper is supposed to properly handle cases where .sun_path does not
contain a NUL byte, and thus copies out the path suffix a NUL as
necessary.
This also reworks the more specific socket_address_unlink() to be a
wrapper around the more generic sockaddr_un_unlink()