It's cheap to get RDRAND and given that srand() is anyway not really
useful for trusted randomness let's use RDRAND for it, after all we have
all the hard work for that already in place.
Otherwise doing comparing a CGroupMask (which is unsigned in effect)
with the result of CGROUP_CONTROLLER_TO_MASK() will result in warnings
about signedness differences.
We now have the "BPF" pseudo-controllers. These should never be assumed
to be accessible as /sys/fs/cgroup/<controller> and not through
"cgroup.subtree_control" either, hence always check explicitly before we
go to the file system. We do this through our new CGROUP_MASK_V1 and
CGROUP_MASK_V2 definitions.
Also, when we fail, don't clobber the return value.
This brings the call more in-line with our usual coding style, and
removes surprises.
None of the callers seemed to care about this behaviour.
This was mostly prompted by seeing the expression "in_initrd() && flags
& PROC_CMDLINE_RD_STRICT", which uses & and && without any brackets.
Let's make that a bit more readable and hide all doubts about operator
precedence.
Let's be more careful with what we serialize: let's ensure we never
serialize strings that are longer than LONG_LINE_MAX, so that we know we
can read them back with read_line(…, LONG_LINE_MAX, …) safely.
In order to implement this all serialization functions are move to
serialize.[ch], and internally will do line size checks. We'd rather
skip a serialization line (with a loud warning) than write an overly
long line out. Of course, this is just a second level protection, after
all the data we serialize shouldn't be this long in the first place.
While we are at it also clean up logging: while serializing make sure to
always log about errors immediately. Also, (void)ify all calls we don't
expect errors in (or catch errors as part of the general
fflush_and_check() at the end.
journald calls fd_get_path() a lot (it probably shouldn't, there's some
room for improvement there, but I'll leave that for another time), hence
it's worth optimizing the call a bit, in particular as it's easy.
Previously we'd open the dir /proc/self/fd/ first, before reading the
symlink inside it. This means the whole function requires three system
calls: open(), readlinkat(), close(). The reason for doing it this way
is to distinguish the case when we see ENOENT because /proc is not
mounted and the case when the fd doesn't exist.
With this change we'll directly go for the readlink(), and only if that
fails do an access() to see if /proc is mounted at all.
This optimizes the common case (where the fd is valid and /proc
mounted), in favour of the uncommon case (where the fd doesn#t exist or
/proc is not mounted).
I noticed while profiling journald that we invoke readlinkat() a ton on
open /proc/self/fd/<fd>, and that the returned paths are more often than
not longer than the 99 chars used before, when we look at archived
journal files. This means for these cases we generally need to execute
two rather than one syscalls.
Let's increase the buffer size a tiny bit, so that we reduce the number
of syscalls executed. This is really a low-hanging fruit of
optimization.
Our current set of flags allows an option to be either
use just in initrd or both in initrd and normal system.
This new flag is intended to be used in the case where
you want apply some settings just in initrd or just
in normal system.
This is useful for a couple of cases, I'm mostly interested in case #1:
1. Verifying "reasonable" values in a trivially scriptable way
2. Debugging unexpected time span parsing directly
Test Plan:
```
% build/systemd-analyze timespan 20
Original: 20
μs: 20
Human: 20us
% build/systemd-analyze timespan 20ms
Original: 20ms
μs: 20000
Human: 20ms
% build/systemd-analyze timespan 20z
Failed to parse time span '20z': Invalid argument
```
This is the counterpiece to the boot counting implemented in
systemd-boot: if a boot is detected as successful we mark drop the
counter again from the booted snippet or kernel image.
For the definition of assert_cc() we try to use static_assert and check
for it with "#ifdef". But that can only work if assert.h is imported
before. Hence let's do so.
This is a nice little optimization when using static const strings: we
can now use them directly as JsonVariant objecs, without any additional
allocation.
Simply as a safety precaution so that json objects we read are not
arbitrary amounts deep, so that code that processes json objects
recursively can't be easily exploited (by hitting stack limits).
Follow-up for oss-fuzz#10908
(Nice is that we can accomodate for this counter without increasing the
size of the JsonVariant object.)
Let's move things around a bit, so that the trailing unused whitespace
within the structure due to padding is placed together, so that it is
easier to use for new fields. (Found with pahole)
Let's clean up the failure codepaths, by using _cleanup_.
This relies on the new behaviour of env_append() introduced in the
previous commit that guarantess the list always remains properly NULL
terminated
Following discussions with some kernel folks at All Systems Go! it
appears that file descriptors are not really as expensive as they used
to be (both memory and performance-wise) and it should thus be OK to allow
programs (including unprivileged ones) to have more of them without ill
effects.
Unfortunately we can't just raise the RLIMIT_NOFILE soft limit
globally for all processes, as select() and friends can't handle fds
>= 1024, and thus unexpecting programs might fail if they accidently get
an fd outside of that range. We can however raise the hard limit, so
that programs that need a lot of fds can opt-in into getting fds beyond
the 1024 boundary, simply by bumping the soft limit to the now higher
hard limit.
This is useful for all our client code that accesses the journal, as the
journal merging logic might need a lot of fds. Let's add a unified
function for bumping the limit in a robust way.
This simply adds a new constant we can use for bumping RLIMIT_NOFILE to
a "high" value. It default to 256K for now, which is pretty high, but
smaller than the kernel built-in limit of 1M.
Previously, some tools that needed a higher RLIMIT_NOFILE bumped it to
16K. This new define goes substantially higher than this, following the
discussion with the kernel folks.
All over the place we define local variables for the various sockopts
that take a bool-like "int" value. Sometimes they are const, sometimes
static, sometimes both, sometimes neither.
Let's clean this up, introduce a common const variable "const_int_one"
(as well as one matching "const_int_zero") and use it everywhere, all
acorss the codebase.
The helper is supposed to properly handle cases where .sun_path does not
contain a NUL byte, and thus copies out the path suffix a NUL as
necessary.
This also reworks the more specific socket_address_unlink() to be a
wrapper around the more generic sockaddr_un_unlink()
This makes use of assert_cc() to guard against missing CASE macros,
instead of a manual implementation that might result in a static
variable to be allocated.
More importantly though this changes the base type for the array used to
determine the number of arguments for the compile time check from "int"
to "long double". This is done in order to avoid warnings from "ubsan"
that possibly large constants are assigned to small types. "long double"
hopefully isn't vulnerable to that.
Fixes: #10332
Mempool use is enabled or disabled based on the mempool_use_allowed symbol that
is linked in.
Should fix assert crashes in external programs caused by #9792.
Replaces #10286.
v2:
- use two different source files instead of a gcc constructor
linux/capability.h's CAP_TO_MASK potentially shifts a signed int "1"
(i.e. 32bit wide) left by 31 which means it becomes negative. That's
just weird, and ubsan complains about it. Let's introduce our own macro
CAP_TO_MASK_CORRECTED which doesn't fall into this trap, and make use of
it.
Fixes: #10347
As preparation for OCI support in nspawn, let's add a JSON parser.
The json.h file contains an explanation why this is new code instead of
just us linking against an existing JSON library.
Cgroup v2 provides the eBPF-based device controller, which isn't currently
supported by systemd. This commit aims to provide such support.
There are no user-visible changes, just the device policy and whitelist
start working if cgroup v2 is used.
Let's make sure the integers we parse out are not larger than USHRT_MAX.
This is a good idea as the kernel's TIOCSWINSZ ioctl for sizing
terminals can't take larger values, and we shouldn't risk an overflow.
Comes with tests.
Also add direct test for $SYSTEMD_PROC_CMDLINE.
In test-proc-cmdline, "true" was masquerading as PROC_CMDLINE_STRIP_RD_PREFIX,
fix that. Also, reorder functions to match call order.
Previously, together with kill_dots true, patch like
".", "./.", ".//.//" would all return an empty string.
That is wrong. There must be one "." left to reference
the current directory.
Also, the comment with examples was wrong.
Apparently FAT on some recent kernels can't do RENAME_NOREPLACE, and of
course cannot do linkat()/unlinkat() either (as the hard link concept
does not exist on FAT). Add a fallback using an explicit beforehand
faccessat() check. This sucks, but what we can do if the safe operations
are not available?
Fixes: #10063