Building Trust into OS images for Confidential Containers

henryyan · ‎Mar 25 2024

This article was originally posted on confidentialcontainers.org by Magnus Kulke. Read the original article on confidentialcontainers.org. The source for this content can be found here.

Containers and OS Images

Confidential Containers using Kata-Containers are launched in a Confidential Virtual Machine (CVM). Those CVMs require a minimal Linux system which will run in our Trusted Execution Environment (TEE) and host the agent side of Kata-Containers (including various auxiliary attestation tools) to launch containers and facilitate secure key releases for a confidential Pod. Integrity of the workload is one of the key pillars for Confidential Computing. Consequently, this implies we also must trust the infrastructure components that host containers on a confidential guest VM, specifically: firmware, rootfs, kernel and kernel cmdline.

For a TEE there are various options to establish this kind of trust. Which option will be used depends on the capabilities and specifics of a TEE. All of them will include various degrees of “measurements” (that is: cryptographic hashes for a blob of data and/or code) to reach the same goal: providing a verifiable statement about the integrity of the OS image. We’ll discuss three viable options; those are not exhaustive

Initial ramdisk

We can opt to not use a rootfs and bundle the required userland components into Linux’ initial ramdisk (initrd), which is loaded by the kernel. Outside a CoCo scenario this facility is used to provide a boot stage in which kernel drivers can be loaded on-demand from a memory-backed (compressed) volume, not having to bundle device drivers for various hardware statically in each vendor kernel. For CoCo VMs, this kind of flexibility is not really required: we do know beforehand the virtualized hardware that our CVM is configured with, and it will require only a limited set of drivers. Due to its static nature, relying solely on an initrd would be impractical for many workloads. For CoCo however, this is a viable option, since the dynamic aspect of its workload is mostly deferred to the container execution. This means we can have the kernel launch a kata-agent as PID 1 directly from an initrd.

This option is appealing for certain CoCo deployments. If we have a Trusted Execution Environment (TEE) that will produce a launch-measurement of the initial RAM state of a CVM, we can use this measurement to gain confidence that our os image is genuine. We can calculate the expected value of a given launch measurement offline and then verify during remote attestation that the actual launch measurement matches our expected value.

Calculating a launch measurement

An expected SEV-SNP launch measurement for Linux direct boot with Qemu can be calculated using trusted artifacts (firmware, kernel & initrd) and a few platform parameters. Please note that the respective kernel/fw components and tools are still being actively developed. The AMDESE/AMDSEV repository provides instructions and pointers to a working set of revisions.

DM Verity

The infrastructure components might outgrow a reasonably sized initrd. We want to limit the initrd to a smaller size, to not spend too much of the CVM’s RAM, but leave as much as possible for the container payload. In the worst case we’ll have to spend ~2x the size of the initrd (we also need to keep the compressed initrd in RAM). With an increasing amount of supported TEEs the infrastructure components will inevitably grow, since they need to support various attestation schemes and hardware. Some of those schemes might also have a larger dependency tree, which is unreasonable to statically link or bundle into an initrd. There might be compliance requirements which mandate the use of certain cryptographic libraries that must not be statically compiled. Those considerations might nudge us to a more traditional Linux setup of kernel, initrd and rootfs.

A rootfs can comfortably host the infrastructure components and we can still package support for all kinds of TEE in a single OS image artifact. However, the dependencies for a given TEE can now be loaded dynamically into RAM. For CVMs there is a restriction when it comes to how an OS image is handled: We must prevent the CVM host, storage provider or anyone else outside the TEE from compromising the image to uphold the integrity of a CoCo workload. dm-verity is a kernel technology that prevents changes to a read-only filesystem during runtime. Block-level access is guarded by a hash tree and on unexpected block data the kernel will panic. This protection scheme requires a root hash that needs to be provided when the verity-protected filesystem is mounted. We can provide this root hash through the kernel cmdline or bake it into the initrd. In any case, the TEE has to include the root hash in a launch measurement to provide verifiable integrity guarantees.

Creating a verity volume

DM-Verity volumes feature a hash tree and a root hash in addition to the actual data. The hash tree can be stored on disk next to the verity volume or as a local file. We’ll store the hash-tree as file for brevity and write a string CoCo into a file /coco on the formatted volume:

Corrupting the image

Now we toggle a bit on the raw image (CoCo => DoCo in /coco). If the image is attached as a block device via dm-verity, there will be IO errors and respective entries in the kernel log, once we attempt to read the file.

vTPM

There are setups in which a launch measurement of the TEEs will not cover the kernel and/or initrd. An example of such a TEE is Azure’s Confidential VM offering (L1 VMs provided by hypervisor running on a physical host). Those CVMs can host Confidential Containers in a CoCo Peerpod setup. The hardware evidence, which is attesting encrypted RAM and CPU registers is exclusively fetched during an early boot phase. Only in later stages the kernel and initrd are loaded from an OS image and hence the launch measurement will not cover the CoCo infrastructure components yet. To still be able to provide an integrity guarantee such a TEE can defer measurements of the later boot stages to a virtual TPM device (vTPM).

To isolate it from the host a confidential vTPM is provisioned within the TEE during early boot and cryptographically linked to the TEE’s hardware evidence. To further secure secrets like private keys from the guest OS, the provisioning is performed at a certain privilege level preventing direct access and manipulation by the guest OS which is running at a lesser privilege level.

TPM is a mature technology, deployed in a lot of hardware to protect operating systems and workloads from being compromised. It’s seeing increased adoption and support in the Linux kernel and userland. A TPM device has multiple Platform Configuration Registers (PCR). Those can hold measurements and they can be “extended” with additional measurements in a one-way function to create a comprehensive, replayable log of events that occur during the boot process. “Measured Boot” is a procedure in which each boot step measures the subsequent step into a specific PCR. As a whole this represents a verifiable state of the system, much like an initial launch measurement, however with more granularity.

Image building

Modern OS Image build tools for Linux like systemd’s mkosi make it trivial to build OS images with dm-verity protection enabled, along with Unified Kernel Images (UKI) which bundles kernel, initrd and kernel cmdline into conveniently measurable artifacts. A modern distribution packaging recent systemd (v253+) revisions like Fedora (38+) will perform the required TPM measurements.

Creating reference values

To retrieve the expected measurements, for a dm-verity protected OS image, we can boot the resulting image in a trusted environment locally. The swtpm project is a great option to provide the virtual machine with a vTPM.

We retrieve VM firmware from debian's repository and attach the vTPM socket as character device:

Comparing PCRs

Once logged into the VM we can retrieve the relevant measurements in the form of PCRs (the package tpm2_tools needs to be available):

If we boot the same image on a Confidential VM in Azure’s cloud, we’ll see different measurements. This is expected since the early boot stack does not match our reference setup:

We can identify the common PCRs between the measurements in a cloud VM and those that we gathered in our reference setup. Those are good candidates to include them as reference values in a relying party against which a TEE’s evidence can be verified.

The UAPI Group’s TPM PCR Registry for Linux and systemd specifies PCR11 as a container for UKI measurements, covering kernel, initrd and kernel cmdline. Further registers that might be worth considering would be PCR4 (shim + UKI) or PCR7 (Secure Boot state).

Conclusion and outlook

We have looked at three different ways of building trust into OS host images for Confidential Containers. The intention was to illustrate how a chain of trust can be established using concrete examples and tools. The scenarios and technologies haven’t been covered comprehensively, each of those would be worth their own in-depth article.

Finally we have so far only covered the (mostly static) steps and components that provide a sandbox for confidential containers. Asserting integrity for containers themselves is a unique challenge for CoCo. There are a lot of dynamic aspects to consider in a realistic container deployment. Future articles might provide insights into how this can be achieved.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Building Trust into OS images for Confidential Containers