docs: import journal file format docs from fdo wiki
Just an import, with no textual changes (some fixed URLs however)
This commit is contained in:
parent
8653185a9e
commit
bbcd38e41e
|
@ -0,0 +1,652 @@
|
||||||
|
---
|
||||||
|
title: Journal File Format
|
||||||
|
category: Interfaces
|
||||||
|
layout: default
|
||||||
|
---
|
||||||
|
|
||||||
|
# Journal File Format
|
||||||
|
|
||||||
|
_Note that this document describes the binary on-disk format of journals
|
||||||
|
only. For interfacing with web technologies there's the [Journal JSON
|
||||||
|
Format](http://www.freedesktop.org/wiki/Software/systemd/json). For transfer
|
||||||
|
of journal data across the network there's the [Journal Export
|
||||||
|
Format](http://www.freedesktop.org/wiki/Software/systemd/export)._
|
||||||
|
|
||||||
|
The systemd journal stores log data in a binary format with several features:
|
||||||
|
|
||||||
|
* Fully indexed by all fields
|
||||||
|
* Can store binary data, up to 2^64-1 in size
|
||||||
|
* Seekable
|
||||||
|
* Primarily append-based, hence robust to corruption
|
||||||
|
* Support for in-line compression
|
||||||
|
* Support for in-line Forward Secure Sealing
|
||||||
|
|
||||||
|
This document explains the basic structure of the file format on disk. We are
|
||||||
|
making this available primarily to allow review and provide documentation. Note
|
||||||
|
that the actual implementation in the [systemd
|
||||||
|
codebase](https://github.com/systemd/systemd/blob/master/src/journal/) is the
|
||||||
|
only ultimately authoritative description of the format, so if this document
|
||||||
|
and the code disagree, the code is right. That said we'll of course try hard to
|
||||||
|
keep this document up-to-date and accurate.
|
||||||
|
|
||||||
|
Instead of implementing your own reader or writer for journal files we ask you
|
||||||
|
to use the [Journal's native C
|
||||||
|
API](http://www.freedesktop.org/software/systemd/man/sd-journal.html) to access
|
||||||
|
these files. It provides you with full access to the files, and will not
|
||||||
|
withhold any data. If you find a limitation, please ping us and we might add
|
||||||
|
some additional interfaces for you.
|
||||||
|
|
||||||
|
If you need access to the raw journal data in serialized stream form without C
|
||||||
|
API our recommendation is to make use of the [Journal Export
|
||||||
|
Format](http://www.freedesktop.org/wiki/Software/systemd/export), which you can
|
||||||
|
get via "journalctl -o export" or via systemd-journal-gatewayd. The export
|
||||||
|
format is much simpler to parse, but complete and accurate. Due to its
|
||||||
|
stream-based nature it is not indexed.
|
||||||
|
|
||||||
|
_Or, to put this in other words: this low-level document is probably not what
|
||||||
|
you want to use as base of your project. You want our [C
|
||||||
|
API](http://www.freedesktop.org/software/systemd/man/sd-journal.html) instead!
|
||||||
|
And if you really don't want the C API, then you want the [Journal Export
|
||||||
|
Format](http://www.freedesktop.org/wiki/Software/systemd/export) instead! This
|
||||||
|
document is primarily for your entertainment and education. Thank you!_
|
||||||
|
|
||||||
|
This document assumes you have a basic understanding of the journal concepts,
|
||||||
|
the properties of a journal entry and so on. If not, please go and read up,
|
||||||
|
then come back! This is a good opportunity to read about the [basic properties
|
||||||
|
of journal
|
||||||
|
entries](http://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html),
|
||||||
|
in particular realize that they may include binary non-text data (though
|
||||||
|
usually don't), and the same field might have multiple values assigned within
|
||||||
|
the same entry.
|
||||||
|
|
||||||
|
This document describes the current format of systemd 195. The documented
|
||||||
|
format is compatible with the format used in the first versions of the journal,
|
||||||
|
but received various compatible additions since.
|
||||||
|
|
||||||
|
If you are wondering why the journal file format has been created in the first
|
||||||
|
place instead of adopting an existing database implementation, please have a
|
||||||
|
look [at this
|
||||||
|
thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html).
|
||||||
|
|
||||||
|
|
||||||
|
## Basics
|
||||||
|
|
||||||
|
* All offsets, sizes, time values, hashes (and most other numeric values) are 64bit unsigned integers in LE format.
|
||||||
|
* Offsets are always relative to the beginning of the file.
|
||||||
|
* The 64bit hash function used is [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function), more specifically jenkins_hashlittle2() with the first 32bit integer it returns as higher 32bit part of the 64bit value, and the second one uses as lower 32bit part.
|
||||||
|
* All structures are aligned to 64bit boundaries and padded to multiples of 64bit
|
||||||
|
* The format is designed to be read and written via memory mapping using multiple mapped windows.
|
||||||
|
* All time values are stored in usec since the respective epoch.
|
||||||
|
* Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (CLOCK_REALTIME)
|
||||||
|
* Monotonic time values are always stored jointly with the kernel boot ID value (i.e. /proc/sys/kernel/random/boot_id) they belong to. They tend to be relative to the start of the boot, but aren't for containers. (CLOCK_MONOTONIC)
|
||||||
|
* Randomized, unique 128bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement.
|
||||||
|
|
||||||
|
## General Rules
|
||||||
|
|
||||||
|
If any kind of corruption is noticed by a writer it should immediately rotate
|
||||||
|
the file and start a new one. No further writes should be attempted to the
|
||||||
|
original file, but it should be left around so that as little data as possible
|
||||||
|
is lost.
|
||||||
|
|
||||||
|
If any kind of corruption is noticed by a reader it should try hard to handle
|
||||||
|
this gracefully, such as skipping over the corrupted data, but allowing access
|
||||||
|
to as much data around it as possible.
|
||||||
|
|
||||||
|
A reader should verify all offsets and other data as it reads it. This includes
|
||||||
|
checking for alignment and range of offsets in the file, especially before
|
||||||
|
trying to read it via a memory map.
|
||||||
|
|
||||||
|
A reader must interleave rotated and corrupted files as good as possible and
|
||||||
|
present them as single stream to the user.
|
||||||
|
|
||||||
|
All fields marked as "reserved" must be initialized with 0 when writing and be
|
||||||
|
ignored on reading. They are currently not used but might be used later on.
|
||||||
|
|
||||||
|
|
||||||
|
## Structure
|
||||||
|
|
||||||
|
The file format's data structures are declared in
|
||||||
|
[journal-def.h](https://github.com/systemd/systemd/blob/master/src/journal/journal-def.h).
|
||||||
|
|
||||||
|
The file format begins with a header structure. After the header structure
|
||||||
|
object structures follow. Objects are appended to the end as time
|
||||||
|
progresses. Most data stored in these objects is not altered anymore after
|
||||||
|
having been written once, with the exception of records necessary for
|
||||||
|
indexing. When new data is appended to a file the writer first writes all new
|
||||||
|
objects to the end of the file, and then links them up at front after that's
|
||||||
|
done. Currently, seven different object types are known:
|
||||||
|
|
||||||
|
```c
|
||||||
|
enum {
|
||||||
|
OBJECT_UNUSED,
|
||||||
|
OBJECT_DATA,
|
||||||
|
OBJECT_FIELD,
|
||||||
|
OBJECT_ENTRY,
|
||||||
|
OBJECT_DATA_HASH_TABLE,
|
||||||
|
OBJECT_FIELD_HASH_TABLE,
|
||||||
|
OBJECT_ENTRY_ARRAY,
|
||||||
|
OBJECT_TAG,
|
||||||
|
_OBJECT_TYPE_MAX
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
* A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=".
|
||||||
|
* A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value.
|
||||||
|
* An **ENTRY** object, which binds several **DATA** objects together into a log entry.
|
||||||
|
* A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects.
|
||||||
|
* A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects.
|
||||||
|
* An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search.
|
||||||
|
* A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later).
|
||||||
|
|
||||||
|
## Header
|
||||||
|
|
||||||
|
The Header struct defines, well, you guessed it, the file header:
|
||||||
|
|
||||||
|
```c
|
||||||
|
_packed_ struct Header {
|
||||||
|
uint8_t signature[8]; /* "LPKSHHRH" */
|
||||||
|
le32_t compatible_flags;
|
||||||
|
le32_t incompatible_flags;
|
||||||
|
uint8_t state;
|
||||||
|
uint8_t reserved[7];
|
||||||
|
sd_id128_t file_id;
|
||||||
|
sd_id128_t machine_id;
|
||||||
|
sd_id128_t boot_id; /* last writer */
|
||||||
|
sd_id128_t seqnum_id;
|
||||||
|
le64_t header_size;
|
||||||
|
le64_t arena_size;
|
||||||
|
le64_t data_hash_table_offset;
|
||||||
|
le64_t data_hash_table_size;
|
||||||
|
le64_t field_hash_table_offset;
|
||||||
|
le64_t field_hash_table_size;
|
||||||
|
le64_t tail_object_offset;
|
||||||
|
le64_t n_objects;
|
||||||
|
le64_t n_entries;
|
||||||
|
le64_t tail_entry_seqnum;
|
||||||
|
le64_t head_entry_seqnum;
|
||||||
|
le64_t entry_array_offset;
|
||||||
|
le64_t head_entry_realtime;
|
||||||
|
le64_t tail_entry_realtime;
|
||||||
|
le64_t tail_entry_monotonic;
|
||||||
|
/* Added in 187 */
|
||||||
|
le64_t n_data;
|
||||||
|
le64_t n_fields;
|
||||||
|
/* Added in 189 */
|
||||||
|
le64_t n_tags;
|
||||||
|
le64_t n_entry_arrays;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
The first 8 bytes of Journal files must contain the ASCII characters LPKSHHRH.
|
||||||
|
|
||||||
|
If a writer finds that the **machine_id** of a file to write to does not match
|
||||||
|
the machine it is running on it should immediately rotate the file and start a
|
||||||
|
new one.
|
||||||
|
|
||||||
|
When journal file is first created the **file_id** is randomly and uniquely
|
||||||
|
initialized.
|
||||||
|
|
||||||
|
When a writer opens a file it shall initialize the **boot_id** to the current
|
||||||
|
boot id of the system.
|
||||||
|
|
||||||
|
The currently used part of the file is the **header_size** plus the
|
||||||
|
**arena_size** field of the header. If a writer needs to write to a file where
|
||||||
|
the actual file size on disk is smaller than the reported value it shall
|
||||||
|
immediately rotate the file and start a new one. If a writer is asked to write
|
||||||
|
to a file with a header that is shorter than his own definition of the struct
|
||||||
|
Header, he shall immediately rotate the file and start a new one.
|
||||||
|
|
||||||
|
The **n_objects** field contains a counter for objects currently available in
|
||||||
|
this file. As objects are appended to the end of the file this counter is
|
||||||
|
increased.
|
||||||
|
|
||||||
|
The first object in the file starts immediately after the header. The last
|
||||||
|
object in the file is at the offset **tail_object_offset**, which may be 0 if
|
||||||
|
no object is in the file yet.
|
||||||
|
|
||||||
|
The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are
|
||||||
|
counters of the objects of the specific types.
|
||||||
|
|
||||||
|
**tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number
|
||||||
|
(see below) of the last or first entry in the file, respectively, or 0 if no
|
||||||
|
entry has been written yet.
|
||||||
|
|
||||||
|
**tail_entry_realtime** and **head_entry_realtime** contain the wallclock
|
||||||
|
timestamp of the last or first entry in the file, respectively, or 0 if no
|
||||||
|
entry has been written yet.
|
||||||
|
|
||||||
|
**tail_entry_monotonic** is the monotonic timestamp of the last entry in the
|
||||||
|
file, referring to monotonic time of the boot identified by **boot_id**.
|
||||||
|
|
||||||
|
|
||||||
|
## Extensibility
|
||||||
|
|
||||||
|
The format is supposed to be extensible in order to enable future additions of
|
||||||
|
features. Readers should simply skip objects of unknown types as they read
|
||||||
|
them. If a compatible feature extension is made a new bit is registered in the
|
||||||
|
header's 'compatible_flags' field. If a feature extension is used that makes
|
||||||
|
the format incompatible a new bit is registered in the header's
|
||||||
|
'incompatible_flags' field. Readers should check these two bit fields, if they
|
||||||
|
find a flag they don't understand in compatible_flags they should continue to
|
||||||
|
read the file, but if they find one in 'incompatible_flags' they should fail,
|
||||||
|
asking for an update of the software. Writers should refuse writing if there's
|
||||||
|
an unknown bit flag in either of these fields.
|
||||||
|
|
||||||
|
The file header may be extended as new features are added. The size of the file
|
||||||
|
header is stored in the header. All header fields up to "n_data" are known to
|
||||||
|
unconditionally exist in all revisions of the file format, all fields starting
|
||||||
|
with "n_data" needs to be explicitly checked for via a size check, since they
|
||||||
|
were additions after the initial release.
|
||||||
|
|
||||||
|
Currently only two extensions flagged in the flags fields are known:
|
||||||
|
|
||||||
|
```c
|
||||||
|
enum {
|
||||||
|
HEADER_INCOMPATIBLE_COMPRESSED = 1
|
||||||
|
};
|
||||||
|
|
||||||
|
enum {
|
||||||
|
HEADER_COMPATIBLE_SEALED = 1
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
HEADER_INCOMPATIBLE_COMPRESSED indicates that the file includes DATA objects
|
||||||
|
that are compressed using XZ.
|
||||||
|
|
||||||
|
HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required
|
||||||
|
for Forward Secure Sealing.
|
||||||
|
|
||||||
|
|
||||||
|
## Dirty Detection
|
||||||
|
|
||||||
|
```c
|
||||||
|
enum {
|
||||||
|
STATE_OFFLINE = 0,
|
||||||
|
STATE_ONLINE = 1,
|
||||||
|
STATE_ARCHIVED = 2,
|
||||||
|
_STATE_MAX
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
If a file is opened for writing the **state** field should be set to
|
||||||
|
STATE_ONLINE. If a file is closed after writing the **state** field should be
|
||||||
|
set to STATE_OFFLINE. After a file has been rotated it should be set to
|
||||||
|
STATE_ARCHIVED. If a writer is asked to write to a file that is not in
|
||||||
|
STATE_OFFLINE it should immediately rotate the file and start a new one,
|
||||||
|
without changing the file.
|
||||||
|
|
||||||
|
After and before the state field is changed fdatasync() should be executed on
|
||||||
|
the file to ensure the dirty state hits disk.
|
||||||
|
|
||||||
|
|
||||||
|
## Sequence Numbers
|
||||||
|
|
||||||
|
All entries carry sequence numbers that are monotonically counted up for each
|
||||||
|
entry (starting at 1) and are unique among all files which carry the same
|
||||||
|
**seqnum_id** field. This field is randomly generated when the journal daemon
|
||||||
|
creates its first file. All files generated by the same journal daemon instance
|
||||||
|
should hence carry the same seqnum_id. This should guarantee a monotonic stream
|
||||||
|
of sequential numbers for easy interleaving even if entries are distributed
|
||||||
|
among several files, such as the system journal and many per-user journals.
|
||||||
|
|
||||||
|
|
||||||
|
## Concurrency
|
||||||
|
|
||||||
|
The file format is designed to be usable in a simultaneous
|
||||||
|
single-writer/multiple-reader scenario. The synchronization model is very weak
|
||||||
|
in order to facilitate storage on the most basic of file systems (well, the
|
||||||
|
most basic ones that provide us with mmap() that is), and allow good
|
||||||
|
performance. No file locking is used. The only time where disk synchronization
|
||||||
|
via fdatasync() should be enforced is after and before changing the **state**
|
||||||
|
field in the file header (see below). It is recommended to execute a memory
|
||||||
|
barrier after appending and initializing new objects at the end of the file,
|
||||||
|
and before linking them up in the earlier objects.
|
||||||
|
|
||||||
|
This weak synchronization model means that it is crucial that readers verify
|
||||||
|
the structural integrity of the file as they read it and handle invalid
|
||||||
|
structure gracefully. (Checking what you read is a pretty good idea out of
|
||||||
|
security considerations anyway.) This specifically includes checking offset
|
||||||
|
values, and that they point to valid objects, with valid sizes and of the type
|
||||||
|
and hash value expected. All code must be written with the fact in mind that a
|
||||||
|
file with inconsistent structure file might just be inconsistent temporarily,
|
||||||
|
and might become consistent later on. Payload OTOH requires less scrutiny, as
|
||||||
|
it should only be linked up (and hence visible to readers) after it was
|
||||||
|
successfully written to memory (though not necessarily to disk). On non-local
|
||||||
|
file systems it is a good idea to verify the payload hashes when reading, in
|
||||||
|
order to avoid annoyances with mmap() inconsistencies.
|
||||||
|
|
||||||
|
Clients intending to show a live view of the journal should use inotify() for
|
||||||
|
this to watch for files changes. Since file writes done via mmap() do not
|
||||||
|
result in inotify() writers shall truncate the file to its current size after
|
||||||
|
writing one or more entries, which results in inotify events being
|
||||||
|
generated. Note that this is not used as transaction scheme (it doesn't protect
|
||||||
|
anything), but merely for triggering wakeups.
|
||||||
|
|
||||||
|
Note that inotify will not work on network file systems if reader and writer
|
||||||
|
reside on different hosts. Readers which detect they are run on journal files
|
||||||
|
on a non-local file system should hence not rely on inotify for live views but
|
||||||
|
fall back to simple time based polling of the files (maybe recheck every 2s).
|
||||||
|
|
||||||
|
|
||||||
|
## Objects
|
||||||
|
|
||||||
|
All objects carry a common header:
|
||||||
|
|
||||||
|
```c
|
||||||
|
enum {
|
||||||
|
OBJECT_COMPRESSED = 1
|
||||||
|
};
|
||||||
|
|
||||||
|
_packed_ struct ObjectHeader {
|
||||||
|
uint8_t type;
|
||||||
|
uint8_t flags;
|
||||||
|
uint8_t reserved[6];
|
||||||
|
le64_t size;
|
||||||
|
uint8_t payload[];
|
||||||
|
};
|
||||||
|
|
||||||
|
The **type** field is one of the object types listed above. The **flags** field
|
||||||
|
currently knows one flag: OBJECT_COMPRESSED. It is only valid for DATA objects
|
||||||
|
and indicates that the data payload is compressed with XZ. If OBJECT_COMPRESSED
|
||||||
|
is set for an object HEADER_INCOMPATIBLE_COMPRESSED must be set for the file as
|
||||||
|
well. The **size** field encodes the size of the object including all its
|
||||||
|
headers and payload.
|
||||||
|
|
||||||
|
|
||||||
|
## Data Objects
|
||||||
|
|
||||||
|
```c
|
||||||
|
_packed_ struct DataObject {
|
||||||
|
ObjectHeader object;
|
||||||
|
le64_t hash;
|
||||||
|
le64_t next_hash_offset;
|
||||||
|
le64_t next_field_offset;
|
||||||
|
le64_t entry_offset; /* the first array entry we store inline */
|
||||||
|
le64_t entry_array_offset;
|
||||||
|
le64_t n_entries;
|
||||||
|
uint8_t payload[];
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
Data objects carry actual field data in the **payload[]** array, including a
|
||||||
|
field name, a '=' and the field data. Example:
|
||||||
|
`_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the
|
||||||
|
payload.
|
||||||
|
|
||||||
|
**next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if
|
||||||
|
a hash collision happens (in a singly linked list, with an offset of 0
|
||||||
|
indicating the end). **next_field_offset** is used to link up data objects with
|
||||||
|
the same field name from the FIELD object of the field used.
|
||||||
|
|
||||||
|
**entry_offset** is an offset to the first ENTRY object referring to this DATA
|
||||||
|
object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with
|
||||||
|
offsets to other entries referencing this DATA object. Storing the offset to
|
||||||
|
the first ENTRY object in-line is an optimization given that many DATA objects
|
||||||
|
will be referenced from a single entry only (for example, `MESSAGE=` frequently
|
||||||
|
includes a practically unique string). **n_entries** is a counter of the total
|
||||||
|
number of ENTRY objects that reference this object, i.e. the sum of all
|
||||||
|
ENTRY_ARRAYS chained up from this object, plus 1.
|
||||||
|
|
||||||
|
The **payload[]** field contains the field name and date unencoded, unless
|
||||||
|
OBJECT_COMPRESSED is set in the `ObjectHeader`, in which case the payload is
|
||||||
|
LZMA compressed.
|
||||||
|
|
||||||
|
|
||||||
|
## Field Objects
|
||||||
|
|
||||||
|
```c
|
||||||
|
_packed_ struct FieldObject {
|
||||||
|
ObjectHeader object;
|
||||||
|
le64_t hash;
|
||||||
|
le64_t next_hash_offset;
|
||||||
|
le64_t head_data_offset;
|
||||||
|
uint8_t payload[];
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
Field objects are used to enumerate all possible values a certain field name
|
||||||
|
can take in the entire journal file.
|
||||||
|
|
||||||
|
The **payload[]** array contains the actual field name, without '=' or any
|
||||||
|
field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of
|
||||||
|
the payload. As for the DATA objects, this too is either the `.file_id` keyed
|
||||||
|
siphash24 hash of the payload, or the non-keyed Jenkins hash.
|
||||||
|
|
||||||
|
**next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE
|
||||||
|
if a hash collision happens (in singly linked list, offset 0 indicating the
|
||||||
|
end). **head_data_offset** points to the first DATA object that shares this
|
||||||
|
field name. It is the head of a singly linked list using DATA's
|
||||||
|
**next_field_offset** offset.
|
||||||
|
|
||||||
|
|
||||||
|
## Entry Objects
|
||||||
|
|
||||||
|
```
|
||||||
|
_packed_ struct EntryItem {
|
||||||
|
le64_t object_offset;
|
||||||
|
le64_t hash;
|
||||||
|
};
|
||||||
|
|
||||||
|
_packed_ struct EntryObject {
|
||||||
|
ObjectHeader object;
|
||||||
|
le64_t seqnum;
|
||||||
|
le64_t realtime;
|
||||||
|
le64_t monotonic;
|
||||||
|
sd_id128_t boot_id;
|
||||||
|
le64_t xor_hash;
|
||||||
|
EntryItem items[];
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
An ENTRY object binds several DATA objects together into one log entry, and
|
||||||
|
includes other metadata such as various timestamps.
|
||||||
|
|
||||||
|
The **seqnum** field contains the sequence number of the entry, **realtime**
|
||||||
|
the realtime timestamp, and **monotonic** the monotonic timestamp for the boot
|
||||||
|
identified by **boot_id**.
|
||||||
|
|
||||||
|
The **xor_hash** field contains a binary XOR of the hashes of the payload of
|
||||||
|
all DATA objects referenced by this ENTRY. This value is usable to check the
|
||||||
|
contents of the entry, being independent of the order of the DATA objects in
|
||||||
|
the array.
|
||||||
|
|
||||||
|
The **items[]** array contains references to all DATA objects of this entry,
|
||||||
|
plus their respective hashes.
|
||||||
|
|
||||||
|
In the file ENTRY objects are written ordered monotonically by sequence
|
||||||
|
number. For continuous parts of the file written during the same boot
|
||||||
|
(i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo
|
||||||
|
wallclock time jumps (due to incorrect clocks being corrected) the realtime
|
||||||
|
timestamps are monotonic too.
|
||||||
|
|
||||||
|
|
||||||
|
## Hash Table Objects
|
||||||
|
|
||||||
|
```c
|
||||||
|
_packed_ struct HashItem {
|
||||||
|
le64_t head_hash_offset;
|
||||||
|
le64_t tail_hash_offset;
|
||||||
|
};
|
||||||
|
|
||||||
|
_packed_ struct HashTableObject {
|
||||||
|
ObjectHeader object;
|
||||||
|
HashItem items[];
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are
|
||||||
|
identical. They implement a simple hash table, which each cell containing
|
||||||
|
offsets to the head and tail of the singly linked list of the DATA and FIELD
|
||||||
|
objects, respectively. DATA's and FIELD's next_hash_offset field are used to
|
||||||
|
chain up the objects. Empty cells have both offsets set to 0.
|
||||||
|
|
||||||
|
Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE
|
||||||
|
objects. Their payload is directly referred to by the file header in the
|
||||||
|
**data_hash_table_offset**, **data_hash_table_size**,
|
||||||
|
**field_hash_table_offset**, **field_hash_table_size** fields. These offsets do
|
||||||
|
_not_ point to the object headers but directly to the payloads. When a new
|
||||||
|
journal file is created the two hash table objects need to be created right
|
||||||
|
away as first two objects in the stream.
|
||||||
|
|
||||||
|
If the hash table fill level is increasing over a certain fill level (Learning
|
||||||
|
from Java's Hashtable for example: > 75%), the writer should rotate the file
|
||||||
|
and create a new one.
|
||||||
|
|
||||||
|
The DATA_HASH_TABLE should be sized taking into account to the maximum size the
|
||||||
|
file is expected to grow, as configured by the administrator or disk space
|
||||||
|
considerations. The FIELD_HASH_TABLE should be sized to a fixed size, as the
|
||||||
|
number of fields should be pretty static it depends only on developers'
|
||||||
|
creativity rather than runtime parameters.
|
||||||
|
|
||||||
|
|
||||||
|
## Entry Array Objects
|
||||||
|
|
||||||
|
|
||||||
|
```c
|
||||||
|
_packed_ struct EntryArrayObject {
|
||||||
|
ObjectHeader object;
|
||||||
|
le64_t next_entry_array_offset;
|
||||||
|
le64_t items[];
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
Entry Arrays are used to store a sorted array of offsets to entries. Entry
|
||||||
|
arrays are strictly sorted by offsets on disk, and hence by their timestamps
|
||||||
|
and sequence numbers (with some restrictions, see above).
|
||||||
|
|
||||||
|
Entry Arrays are chained up. If one entry array is full another one is
|
||||||
|
allocated and the **next_entry_array_offset** field of the old one pointed to
|
||||||
|
it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the
|
||||||
|
list. To optimize allocation and seeking, as entry arrays are appended to a
|
||||||
|
chain of entry arrays they should increase in size (double).
|
||||||
|
|
||||||
|
Due to being monotonically ordered entry arrays may be searched with a binary
|
||||||
|
search (bisection).
|
||||||
|
|
||||||
|
One chain of entry arrays links up all entries written to the journal. The
|
||||||
|
first entry array is referenced in the **entry_array_offset** field of the
|
||||||
|
header.
|
||||||
|
|
||||||
|
Each DATA object also references an entry array chain listing all entries
|
||||||
|
referencing a specific DATA object. Since many DATA objects are only referenced
|
||||||
|
by a single ENTRY the first offset of the list is stored inside the DATA object
|
||||||
|
itself, an ENTRY_ARRAY object is only needed if it is referenced by more than
|
||||||
|
one ENTRY.
|
||||||
|
|
||||||
|
|
||||||
|
## Tag Object
|
||||||
|
|
||||||
|
```c
|
||||||
|
#define TAG_LENGTH (256/8)
|
||||||
|
|
||||||
|
_packed_ struct TagObject {
|
||||||
|
ObjectHeader object;
|
||||||
|
le64_t seqnum;
|
||||||
|
le64_t epoch;
|
||||||
|
uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
Tag objects are used to seal off the journal for alteration. In regular
|
||||||
|
intervals a tag object is appended to the file. The tag object consists of a
|
||||||
|
SHA-256 HMAC tag that is calculated from the objects stored in the file since
|
||||||
|
the last tag was written, or from the beginning if no tag was written yet. The
|
||||||
|
key for the HMAC is calculated via the externally maintained FSPRG logic for
|
||||||
|
the epoch that is written into **epoch**. The sequence number **seqnum** is
|
||||||
|
increased with each tag. When calculating the HMAC of objects header fields
|
||||||
|
that are volatile are excluded (skipped). More specifically all fields that
|
||||||
|
might validly be altered to maintain a consistent file structure (such as
|
||||||
|
offsets to objects added later for the purpose of linked lists and suchlike)
|
||||||
|
after an object has been written are not protected by the tag. This means a
|
||||||
|
verifier has to independently check these fields for consistency of
|
||||||
|
structure. For the fields excluded from the HMAC please consult the source code
|
||||||
|
directly. A verifier should read the file from the beginning to the end, always
|
||||||
|
calculating the HMAC for the objects it reads. Each time a tag object is
|
||||||
|
encountered the HMAC should be verified and restarted. The tag object sequence
|
||||||
|
numbers need to increase strictly monotonically. Tag objects themselves are
|
||||||
|
partially protected by the HMAC (i.e. seqnum and epoch is included, the tag
|
||||||
|
itself not).
|
||||||
|
|
||||||
|
|
||||||
|
## Algorithms
|
||||||
|
|
||||||
|
### Reading
|
||||||
|
|
||||||
|
Given an offset to an entry all data fields are easily found by following the
|
||||||
|
offsets in the data item array of the entry.
|
||||||
|
|
||||||
|
Listing entries without filter is done by traversing the list of entry arrays
|
||||||
|
starting with the headers' **entry_array_offset** field.
|
||||||
|
|
||||||
|
Seeking to an entry by timestamp or sequence number (without any matches) is
|
||||||
|
done via binary search in the entry arrays starting with the header's
|
||||||
|
**entry_array_offset** field. Since these arrays double in size as more are
|
||||||
|
added the time cost of seeking is O(log(n)*log(n)) if n is the number of
|
||||||
|
entries in the file.
|
||||||
|
|
||||||
|
When seeking or listing with one field match applied the DATA object of the
|
||||||
|
match is first identified, and then its data entry array chain traversed. The
|
||||||
|
time cost is the same as for seeks/listings with no match.
|
||||||
|
|
||||||
|
If multiple matches are applied, multiple chains of entry arrays should be
|
||||||
|
traversed in parallel. Since they all are strictly monotonically ordered by
|
||||||
|
offset of the entries, advancing in one can be directly applied to the others,
|
||||||
|
until an entry matching all matches is found. In the worst case seeking like
|
||||||
|
this is O(n) where n is the number of matching entries of the "loosest" match,
|
||||||
|
but in the common case should be much more efficient at least for the
|
||||||
|
well-known fields, where the set of possible field values tend to be closely
|
||||||
|
related. Checking whether an entry matches a number of matches is efficient
|
||||||
|
since the item array of the entry contains hashes of all data fields
|
||||||
|
referenced, and the number of data fields of an entry is generally small (<
|
||||||
|
30).
|
||||||
|
|
||||||
|
When interleaving multiple journal files seeking tends to be a frequently used
|
||||||
|
operation, but in this case can be effectively suppressed by caching results
|
||||||
|
from previous entries.
|
||||||
|
|
||||||
|
When listing all possible values a certain field can take it is sufficient to
|
||||||
|
look up the FIELD object and follow the chain of links to all DATA it includes.
|
||||||
|
|
||||||
|
### Writing
|
||||||
|
|
||||||
|
When an entry is appended to the journal for each of its data fields the data
|
||||||
|
hash table should be checked. If the data field does not yet exist in the file
|
||||||
|
it should be appended and added to the data hash table. When a field data
|
||||||
|
object is added the field hash table should be checked for the field name of
|
||||||
|
the data field, and a field object be added if necessary. After all data fields
|
||||||
|
(and recursively all field names) of the new entry are appended and linked up
|
||||||
|
in the hashtables the entry object should be appended and linked up too.
|
||||||
|
|
||||||
|
In regular intervals a tag object should be written if sealing is enabled (see
|
||||||
|
above). Before the file is closed a tag should be written too, to seal it off.
|
||||||
|
|
||||||
|
Before writing an object, time and disk space limits should be checked and
|
||||||
|
rotation triggered if necessary.
|
||||||
|
|
||||||
|
|
||||||
|
## Optimizing Disk IO
|
||||||
|
|
||||||
|
_A few general ideas to keep in mind:_
|
||||||
|
|
||||||
|
The hash tables for looking up fields and data should be quickly in the memory
|
||||||
|
cache and not hurt performance. All entries and entry arrays are ordered
|
||||||
|
strictly by time on disk, and hence should expose an OK access pattern on
|
||||||
|
rotating media, when read sequentially (which should be the most common case,
|
||||||
|
given the nature of log data).
|
||||||
|
|
||||||
|
The disk access patterns of the binary search for entries needed for seeking
|
||||||
|
are problematic on rotating disks. This should not be a major issue though,
|
||||||
|
since seeking should not be a frequent operation.
|
||||||
|
|
||||||
|
When reading, collecting data fields for presenting entries to the user is
|
||||||
|
problematic on rotating disks. In order to optimize these patterns the item
|
||||||
|
array of entry objects should be sorted by disk offset before
|
||||||
|
writing. Effectively, frequently used data objects should be in the memory
|
||||||
|
cache quickly. Non-frequently used data objects are likely to be located
|
||||||
|
between the previous and current entry when reading and hence should expose an
|
||||||
|
OK access pattern. Problematic are data objects that are neither frequently nor
|
||||||
|
infrequently referenced, which will cost seek time.
|
||||||
|
|
||||||
|
And that's all there is to it.
|
||||||
|
|
||||||
|
Thanks for your interest!
|
|
@ -98,7 +98,7 @@ And now, here's the list of (hopefully) all APIs that we have introduced with sy
|
||||||
| [Unit file format](https://www.freedesktop.org/software/systemd/man/systemd.unit.html) | File format | yes | yes | numerous | no | - | no |
|
| [Unit file format](https://www.freedesktop.org/software/systemd/man/systemd.unit.html) | File format | yes | yes | numerous | no | - | no |
|
||||||
| [Network](https://www.freedesktop.org/software/systemd/man/systemd.network.html) & [Netdev file format](https://www.freedesktop.org/software/systemd/man/systemd.netdev.html) | File format | yes | yes | no | no | - | no |
|
| [Network](https://www.freedesktop.org/software/systemd/man/systemd.network.html) & [Netdev file format](https://www.freedesktop.org/software/systemd/man/systemd.netdev.html) | File format | yes | yes | no | no | - | no |
|
||||||
| [Link file format](https://www.freedesktop.org/software/systemd/man/systemd.link.html) | File format | yes | yes | no | no | - | no |
|
| [Link file format](https://www.freedesktop.org/software/systemd/man/systemd.link.html) | File format | yes | yes | no | no | - | no |
|
||||||
| [Journal File Format](https://www.freedesktop.org/wiki/Software/systemd/journal-files) | File format | yes | yes | - | maybe | - | no |
|
| [Journal File Format](https://systemd.io/JOURNAL_FILE_FORMAT) | File format | yes | yes | - | maybe | - | no |
|
||||||
| [Journal Export Format](https://www.freedesktop.org/wiki/Software/systemd/export) | File format | yes | yes | - | yes | - | no |
|
| [Journal Export Format](https://www.freedesktop.org/wiki/Software/systemd/export) | File format | yes | yes | - | yes | - | no |
|
||||||
| [Cooperation in cgroup tree](https://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups) | Treaty | yes | yes | libvirt | yes | libvirt | no |
|
| [Cooperation in cgroup tree](https://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups) | Treaty | yes | yes | libvirt | yes | libvirt | no |
|
||||||
| [Password Agents](https://www.freedesktop.org/wiki/Software/systemd/PasswordAgents) | Socket+Files | yes | yes | - | yes | - | no |
|
| [Password Agents](https://www.freedesktop.org/wiki/Software/systemd/PasswordAgents) | Socket+Files | yes | yes | - | yes | - | no |
|
||||||
|
|
|
@ -9,7 +9,7 @@
|
||||||
/*
|
/*
|
||||||
* If you change this file you probably should also change its documentation:
|
* If you change this file you probably should also change its documentation:
|
||||||
*
|
*
|
||||||
* http://www.freedesktop.org/wiki/Software/systemd/journal-files
|
* https://systemd.io/JOURNAL_FILE_FORMAT
|
||||||
*/
|
*/
|
||||||
|
|
||||||
typedef struct Header Header;
|
typedef struct Header Header;
|
||||||
|
|
Loading…
Reference in New Issue