This article was originally posted on LWN.net by Luca Boccassi. Read the original article on LWN.net.
Following up from last year's first Image-Based Linux Summit, a second meeting was held in Berlin on September 12th, 2023, the day before All Systems Go! 2023, at the Microsoft office. The goal of these summits is to find common ground among stakeholders from various engineering groups around the topic of image-based Linux distributions, communicate progress, and attempt to build a strategy to tackle shared problems together. The organizers — Luca Boccassi, Lennart Poettering, and Christian Brauner — welcomed participants from the UAPI Group, which draws developers from a long list of companies with an interest in this area, and spent the full day discussing a variety of topics. Full minutes have been published on the UAPI Group’s web site.
Progress since last year
Progress achieved since the last summit was discussed first. The UAPI Group has been set up, with a GitHub organization and a new web site that is already gathering specifications relevant to image-based Linux; these include those for unified kernel images (UKIs) and discoverable disk images (DDIs). More specifications are being worked on, including a specification to formalize how to handle configuration files on a hermetic /usr system — the drop-ins, masking, and /etc/ -> /run/ -> /usr/ patterns already familiar to users of systemd and programs built on libeconf.
The systemd project has implemented a lot of changes, many of which were initially suggested at last year’s summit. Systemd-boot and systemd-stub gained several new features, including add-ons support (signed PE binaries for kernel command-line additions). UKIs can now be built with a new Python tool, ukify, that doesn't depend on objcopy and, thus, supports cross-architecture assembly of images; these can include many new metadata fields, such as signed PCR11 measurements from the TPM.
Several components of the machine will now be measured by systemd so that secrets can be tied not just to a UKI vendor, but also to specific system information such as a disk encryption key or disk UUID, and also to a specific phase of the boot process. Systemd-repart and mkosi can now build images without privileges or loop devices. They can also be used to build initrds fully based on packages, with no dracut/initramfs-tools involvement (this used to be implemented by a separate mkosi-initrd project, but is now supported by mkosi itself).
On the provisioning side, SMBIOS is now supported to provide read-only, ephemeral configuration to a virtual machine; this data can include systemd credentials, which are now supported by most systemd components, including generators. The new "confext" type DDIs are also supported for dm-verity-protected images that deliver configuration data to be overlaid upon /etc. Sealing secrets against the TPM can now be done "offline" (from a different machine), having only the target's public key. Fully encrypted TPM sessions are used, creating and pinning the storage root key (SRK) if not already present. Last but not least, a new soft-reboot mechanism was added that only reboots user space, leaving the kernel running, which is useful in image-based systems for updating from one image version to the next with minimal latency or loss of connectivity and state.
Distributors have also done their fair share of work over the last year:
- NixOS is working on a Rust version of systemd-stub and ukify to boot and build UKIs; systemd-repart and systemd-sysupdate are available. NixOS now offers systemd-networkd in its initrd.
- Flatcar now uses systemd-sysext extensively for A/B updates of OEM software and provides a set of scripts to build third-party software and deploy it with systemd-sysupdate. Flatcar implements factory-reset functionality, and /etc is now an overlay on top of a read-only base.
- Ubuntu recently made the tech news with the announcement of a desktop flavor that enables TPM-backed disk encryption by default, which is one of the main goals for all of the summit’s attendees. Ubuntu uses systemd-stub for the Snap-based kernel updates, and to pre-calculate PCRs for TPM sealing.
- GNOME OS already supported systemd-boot, and now also uses systemd-repart for partitioning on first boot. GNOME OS deploys UKIs with an initrd built using dracut.
- SUSE is steadily working toward using systemd-boot in various OS flavors, including MicroOS, and the YaST installer was enhanced to support this. Aeon, formerly MicroOS Desktop, is a new image-based flavor that uses systemd-boot, image-based updates, and full-disk encryption by default, and is exploring adding support for systemd-homed for managing users and home directories. Tumbleweed has fully embraced hermetic /usr and no longer ships files under /boot; work is in progress toward moving the default configuration files from /etc to /usr.
- The Fedora installer, Anaconda, gained native support for installing systems using systemd-boot as an alternative to GRUB.
Finally, UKI support is spreading to other projects; patches were proposed to allow loading them from GRUB, and OSBuild has gained native support for building them.
As expected, most of the focus in the past year has been on improving the situation around boot security. Linux has long been left behind by Windows and macOS in this area, and it is refreshing to see such a renewed and concerted effort to close this embarrassing gap.
hermetic /usr and sysext/confext
The discussion around sysext and confext has been gaining traction recently. These are two types of extension images, or DDIs that provide read-only additions to a root or base filesystem, extending respectively the /usr and /etc hierarchies. Currently, the sysext/confext overlay is read-only, prompting the question of whether an optional writable layer or mode should be added, though not set as the default. This mode could be either ephemeral or persistent. Additionally, there's a proposal to move the OS layer to the top of the stack; it currently resides at the bottom. Suggestions have been made to address these issues using symlinks, which is currently being worked on, and there's an idea to introduce an ordering guarantee to the sysext specification, which was implemented shortly after the summit.
Configuration management for image-based systems
There are two sides to this discussion. First of all, there is the question of how to get a configuration into a virtual machine. A common mechanism is provided by cloud-init and a faux-network connection. This is far from optimal, as requiring a full network handshake is slow, cumbersome, and fraught with vendor-specific pitfalls. An idea to improve the general experience around this flow would be to use a network namespace plus virtual routing and forwarding (VRF) to let tools like cloud-init have their own private network connection to the local hypervisor/cloud fabric without affecting the rest of the system.
A better alternative would be to use something that is not network-based at all. Systemd gained support for SMBIOS Type 11 objects, which are already supported by QEMU and Cloud Hypervisor. These objects work well for a user’s virtual machine, but they are problematic for some cloud vendors to support as the SMBIOS strings need to be fixed some time before the required configuration data is available. A proposed alternative would be a new ACPI driver and pseudo-device provided by the firmware or hypervisor that generates the data on-the-fly when requested, in a blocking mode. Systemd would provide a synchronization point in the initrd that services can hook into and synthesize systemd credentials; then systemd would reload the configuration, proceed with the transition, and exit the initrd phase.
This would essentially amount to adding a third phase to the boot process, in the initrd, when additional resources become available as a consequence of the first phase doing the required actions, before transitioning to the rootfs. The latter part would be relatively easy to implement, with the ACPI driver being the most difficult piece of work. If a cloud vendor volunteers to do this work, then it could be easily integrated.
The second sub-topic concerns how configuration files are consumed by services; SUSE has been actively working on adding support for libeconf in upstream projects for many years. While progress has been made, certain applications, like apache2 and nginx, still rely on files in /etc to function properly. Complex configuration files, often in XML format, have also posed challenges. Fedora has introduced patches to address these issues, demonstrating the ongoing efforts to achieve hermetic /usr.
The main action item is to create a tracking issue to list upstream projects that still have to be updated to support this configuration model, so that contributors can collaborate to shrink the list. While the work is not finished, the situation on this topic is in a much better place than it was a few years ago, thanks to the work of many stakeholders. Finally, a specification detailing how libeconf and systemd handle configuration files, aimed at developers implementing their own configuration loading, was recently published.
Systemd credentials
Still on the topic of configuration, the question was raised on how to update systemd credentials. At the moment, they are static; a service receives its credentials at startup time, and they cannot change until the service itself is restarted. For a lot of use cases this is enough, but for some it is not; for example, certificates might be rotated for a service that is sensitive to interruptions. Normally the recommended pattern is to use the file descriptor store to achieve fast restarts (one issue raised was the lack of documentation around this systemd feature, which was promptly fixed. just after the summit as a consequence), but in some cases the service interruption is too expensive to contemplate for such a configuration update.
Fortunately, work is already scheduled to integrate the confext feature into the "reload" mechanism, which is traditionally used to send a SIGHUP to a service, to also reload the stack of confexts in case some were updated. The same pattern could be used for credentials; on reload, credentials are opened again and updated, so that interrupt-free updates can be performed.
Another issue was raised: currently, the documentation states that the path to loaded credentials has to be derived from an environment variable, which is problematic for projects that do not support environment variables or search paths. But it turns out this was set up this way only because of user units, which depend on the user ID; for system services it is actually fixed. The documentation will be updated to clarify this, hopefully removing a (small) barrier to adoption.
Separately, it was also mentioned that there is currently no way to enumerate existing credentials; the proposal is to enhance the systemd-creds tool to do this job as well. Another future improvement that is already being worked on is asymmetric TPM-based encryption, so that credentials can be encrypted, away from the host, using only the target’s TPM’s public key. Currently only symmetric encryption is supported, which can be tedious to use as it requires key sharing.
/efi vs /boot vs /boot/efi vs /run/efi
The debate over the mount point for the EFI system partition (ESP) is ongoing. The issue is that when both a Linux extended boot (XBOOTLDR) partition and an ESP are present, it is unclear where each should be mounted by default. Generally speaking, the ESP is where you always want to store bootloaders, and XBOOTLDR is where you want to store kernels (and initrds), as they will likely require much more disk space. SUSE's RPM-based filesystem creates base directories, which can be problematic for top-level directories serving as mount points. A suggestion to use automount was discussed, questioning the necessity of manually mounting these directories.
Different tools, including bootctl, fwupd, kernel-install, and systemd-logind, require access to these locations for various purposes. The challenge is to ensure that these tools don't double-mount directories. There's a proposal to establish consistent standards and APIs for handling these paths, along with discussions about default directory locations and conflicts with other specifications. The first order of business was to reconcile the discoverable partitions specification and the bootloader specification so that they suggest the same approach; that was fixed shortly after the summit. Whether bootctl could be used to provide a unified interface to access the EFI partition will also be explored.
TPM
Work in the area of measured boot was one of the focal points of last year’s summit, and this year was no different. After a brief recap of all the work that has been done to implement sealing against signed policies, so that secrets can be entrusted to be decrypted only when booting images from the same vendor, attention focused on upcoming developments that will also allow sealing against the state of the local machine.
The upcoming systemd-pcrlock tool will allow sealing against a policy that takes into account PCRs zero to seven, which are owned by the firmware, but that policy can also be temporarily relaxed (if the system is in a known good state) when a firmware update is applied. Such updates can optionally provide a list of expected measurements (in the TCG CEL-JSON format) that will be used for the new policy. If, instead, those measurements are not provided, the next boot process will remeasure and add the new state to the policy, making it strict again. This ensures that attackers cannot simply relax the policy when booting their own systems, as the policy can be changed only when the system is in a known good state. If vendors collaborate by providing the measurements file, then security is never downgraded, not even temporarily. This feature represents a substantial step forward from the status quo, which requires re-sealing secrets on every update, thus changing a disk’s superblock, and making it essentially unfeasible to seal many objects (e.g. systemd credentials).
In order to develop this feature, an append-only event log had to be added to systemd, as measurements need to be replayed. This event log follows the TCG event log specification closely, so that it can be translated to or from that format. It was discussed whether to provide an API for it so that applications can also consume or append to the event log; the idea was deemed acceptable, if someone was willing to implement it.
There are a few corner cases that still need some work — for kexec, it is currently unclear what the best course of action would be. Current thinking is to change the policy to measure a nonce and expect it to be provided by the new kernel, so that only the next kernel can be successfully validated. Also, on factory reset, several machine-specific identifiers that are measured would be lost, so a solution is needed. Ubuntu Core stores an encrypted object on installation that allows such a reset, and systemd should be enhanced to provide the same capability.
Finally, non-disk-encryption usage of the TPM was discussed in order to gauge interest. The systemd credentials use case was already mentioned, but there are others, chiefly remote attestation. System Transparency provides OSS tools for it, and so does Keylime, which is integrated into SUSE MicroOS.
Unified kernels and pre-built initrd
The discussion about unified kernels and pre-built initrd was brief, as most of the work has already been done and embraced by participants. The main news is around the add-ons feature, which allows the platform owner’s optional enhancements to be added on top of the OS vendor’s images. This supports kernel command-line extensions for now, support for DTB is under review, and next on the to-do list is initrd add-ons. Finally, new sections will likely be documented in the UKI specification; these include support for embedded microcode, so that it can be loaded first in the on-the-fly generated initrd that is passed to the kernel, as that’s the established protocol.
systemd-sysusers and "user" users
The systemd-sysusers and "user" users discussion focuses on the addition of a switch to copy /etc/skel so that it can be used for "normal" users too, and not just "system" users. This would be a lightweight integration, focusing exclusively on support for the home directory and /etc/skel.
Homed in openSUSE Aeon
The final topic discussed was systemd-homed, and the attempted integration into SUSE Aeon. This effort suffered from a number of paper cuts, but it seems that most are solved or are being solved.
The first issue is provisioning. Since partitions have to be sized accurately according to the number of users, this needs to be known ahead of time. There are two solutions for this problem: first of all, by using Btrfs subvolumes, the problem goes away entirely as there’s no need to resize partitions, since space is allocated dynamically as needed. However, native encryption of subvolumes is not supported by Btrfs yet, although it is being worked on. There is also ongoing work in GNOME to provide an interface to interactively resize partitions when needed.
The second issue is integration into the desktop GUI, which is currently lacking but, once again, GNOME is working to implement it so that homed user management is fully integrated into GNOME account management. Furthermore, the lack of an upstream SELinux policy for homed was another issue that was discussed, but work is ongoing in Fedora to add support for it.
Finally, how to properly size /home relative to the root filesystem was discussed. Android uses dm-linear to create live partition "extensions" without needing to reallocate data; systemd-repart and homed could be enhanced to use the same pattern, so that space could be cheaply reassigned between the two partitions. An alternative approach could be to only have one partition and rely on Opal self-encrypting drives to protect the contents of the root directory.
Conclusions
After a long day of discussions, participants were tired but happy. The summit was again positive and productive; lots of good ideas and action items came out of it. And now we have about a year for the hard part: actually implementing them. Keep an eye on our changelogs for further updates.