Several years ago, someone wrote that the world needs more blogs. I finally did something that I felt deserved a blog post, so here it is!

This post follows my journey to understand a strange bug I found while researching full-disk encryption. I was working on some other tasks when I noticed my server improperly decrypting itself and figured that it was worth taking a look. I ended up rediscovering a bug that had been patched over 2 years ago, but my path to understanding the bug passed hugo clone repository layouts already existsthrough a bunch of interesting Linux topics, and I ended up with some good takeaways too.

This post is fairly technical, I’ll try to explain all of the important concepts as I go.

Background

I’ve recently become interested in full-disk encryption* setups for devices that aren’t directly used by people, and so can’t have a password typed in when needed. There’s a clever approach using a Trusted Platform Module (TPM), which is a dedicated processor that can securely measure a system’s state and encrypt/decrypt data using keys that can’t be accessed directly. For disk encryption, the TPM measures the state of the system in a known-good configuration and the disk encryption key is then loaded into and encrypted by the TPM, referred to as sealing. After that point, the disk encryption key can only be released from the TPM if the measurements still match. These measurements are implemented as comparisons against a series of registers called Platform Control Registers (PCRs), which each measure different properties of the system (BIOS image, kernel image, secure boot configuration, etc.). The PCRs are part of the TPM and can't be easily modified by an attacker, so the measurements are secure.

It’s a nice idea. As an attacker, you can boot the system fully, but then it’s booting into a secure configuration that you can’t log in to. You can’t change that configuration because all the system data is encrypted, and if you try and modify the early boot code to print the disk key or add some other vulnerability, the TPM will measure the difference and refuse to provide the disk key, so they system doesn’t even boot up. No matter where you try and change data on the system, it will be detected and prevented.

It’s not this simple in practice, a bunch of little details can let an attacker get around this setup, but in principle it’s an elegant solution to a relatively common problem. And there are a few tools to set this system up automatically, systemd-cryptenroll and clevis. This adventure concerns the former, which had a disturbingly simple failure mode that would completely bypass the security of this system.

[*] Technically not full disk, the EFI partition is unencrypted, as is the kernel image and initramfs.

Stumbling across a bug

systemd-cryptenroll is supposed to make it easy to decrypt an encrypted volume using a hardware device like a Yubikey or a TPM, which is what I’m using. It handles generating a new disk encryption key, sealing that key to the TPM, and writing the metadata to the encrypted volume so it knows how to use the TPM in the future. By default, `systemd-cryptenroll` only locks the key against the secure boot PCRconfiguration, stored in PCR 7. After running cryptenroll, all the user needs to do is modify the /etc/crypttab file so that the kernel knows that it will need to decrypt the drive when booting up and then update the initramfs so that the decryption logic can actually exist before the main filesystem is decrypted.

Initramfs is a special file system that’s loaded by the kernel before literally everything else. It’s there because you might need special scripts to mount other file systems. This is actually a textbook example: we need some extra code to talk to the TPM and then decrypt the main partition with the key it gives us. These details aren’t that important for the bug I ran into (but I didn’t know that at the time, so I learned about them anyway).

When setting up cryptenroll on Ubuntu Server 22.04, I immediately ran into issues. I installed the dependencies, ran the command to enroll the TPM, systemd-cryptenroll --tpm2-device=auto /dev/sda3, and updated /etc/crypttab from

dm_crypt-0 UUID=e8ce17e6-d4c6-4af9-82b1-88d78f7ed921 none luks

to

dm_crypt-0 UUID=e8ce17e6-d4c6-4af9-82b1-88d78f7ed921 none tpm2-device=auto

Note the change in the last part, this should have set it up to use the TPM. But trying to finalize the changes with update-initramfs -u resulted in an error:

cryptsetup: WARNING: sda3_crypt: ignoring unknown option 'tpm2-device'

After some searching, I found a github repo that solved the issue. I ran the script, and update-initramfs worked as expected. Upon rebooting, I was no longer prompted for a password and the server decrypted the boot drive automatically. It looked like a success. As a final test, I enabled secure boot, which should alter the state of PCR 7 (the PCR being measured by default). I rebooted and it still decrypted!

Wait, what?

Confusion

by Yosha

Here’s how the system should be working:

  1. The TPM measures the secure boot state and puts it in in PCR 7.
  2. systemd-cryptenroll generates a new disk encryption key.
  3. systemd-cryptenroll seals the key to the TPM with a policy specifying to only unseal it when PCR 7 matches the current measurement.
  4. When booting, the system attempts to unseal the disk key. If secure boot hasn’t changed, it succeeds and continues booting normally.
    • In this case, secure boot did change, so PCR 7 should be different.

Clearly at least one of these steps wasn’t happening.

Understanding the Bug

My first thought was that the patch was probably bad. It was from a relatively unknown developer, and it was possible they just made a mistake. But looking at it, there wasn’t much going on. It basically did two things: patch some functions used while building the initramfs and add the necessary DLLs to the initramfs. The changes were small enough that I read every line and was confident that nothing was wrong.

After that I just kept going down the list of things that might be wrong. I tried

  • Measuring against additional PCRs - still gave the key when it shouldn’t
  • Inspecting PCR values using tpm2-tools - they were visibly different between boot attempts
  • Switching TPM chips - system no longer decrypted, as expected
  • Switching distros - Fedora worked correctly in all cases, so it was related to something with Ubuntu

I spent quite a while trying to figure out how to add tpm2-tools to the initramfs in order to inspect the PCR values at the time they would actually be used. I thought that maybe the PCRs couldn’t properly be accessed during the early stages of the boot process when the main disk is decrypted. But it ended up showing that the PCR values properly measured the secure boot state before they were checked against. So as far as I could tell, the TPM was working correctly, which meant the issue had to be with what systemd was doing.

After talking with my friend Max Dulin, who has some experience with TPMs, I decided to run strace on the original systemd-cryptenroll command. My goal was to determine exactly what commands were being sent to the TPM during enrollment. Unfortunately, the output from strace is raw data:

write(6, "\x80\x01\x00\x00\x00\x0c\x00\x00\x01\x7b\x00\x08", 12) = 12
read(6, "\x80\x01\x00\x00\x00\x14\x00\x00\x00\x00", 10) = 10
read(6, "\x00\x08\x36\x3c\x0a\xc7\xb4\xa3\x93\x57", 10) = 10
write(6, "\x80\x01\x00\x00\x00\x0c\x00\x00\x01\x44\x00\x00", 12) = 12
read(6, "\x80\x01\x00\x00\x00\x0a\x00\x00\x01\x00", 10) = 10
write(6, "\x80\x02\x00\x00\x00\x43\x00\x00\x01\x31\x40\x00\x00\x01\x00...", 67) = 67
read(6, "\x80\x02\x00\x00\x01\x1a\x00\x00\x00\x00", 10) = 10
read(6, "\x80\xff\xff\xff\x00\x00\x01\x03\x00\x5a\x00\x23\x00\x0b\x00...", 272) = 272
write(6, "\x80\x01\x00\x00\x00\x3f\x00\x00\x01\x76\x40\x00\x00\x07\x40...", 63) = 63
read(6, "\x80\x01\x00\x00\x00\x30\x00\x00\x00\x00", 10) = 10
...

Decoding this was very fun. I ended up doing it by hand. When I finally finished, I noticed something strange. The important command was this:

PolicyPcr (sessionHandle: 03000000) (digest: 0) (pcrs: 1 [(TPM_ALG_SHA256:7)]

The important part of this line is TPM_ALG_SHA256, which is saying that the policy uses the SHA-256 PCRs. But that’s definitely wrong, when I’ve been inspecting the PCR values they’ve all been SHA-1, not SHA-256. In fact, I don’t even think the TPM in my server supports SHA-256 PCRs at all. It seemed quite likely that this was the root cause of the issue.

To confirm it, I reinstalled Fedora again and ran strace on the systemd-cryptenroll command. But I didn’t even need to decode the TPM requests to know why Fedora was working where Ubuntu was not. In bold text under the command there was a message that I had ignored the last time I tried Fedora, but now seemed blindingly obvious.

/root# systemd-cryptenroll --tpm2-device=auto /dev/sda3
TPM2 device lacks support for SHA256 PCR bank, but SHA1 bank is supported
and SHA1 PCRs are valid, falling back to SHA1 bank. This reduces the
security level substantially.

Judging by the lack of this error message on Ubuntu, it seemed like older versions of systemd must have assumed that every TPM supported SHA-256 PCRs. And the TPM, when receiving a policy that referenced a PCR that didn’t exist, would interpret the policy as always being valid. So the TPM would release the disk key regardless of any changes made to the server. While I thought the TPM was measuring PCR 7, it was actually measuring nothing.

I was able to verify that this was indeed the problem by looking at the systemd github repository. I found the blame for the error message I was only getting on Fedora. It came from a patch for a 2 year old bug report about the same issue I was seeing. The patch was released with systemd version 250, but Ubuntu 22 ships with systemd v249. It’s definitely not a very widespread issue, since you need to run a specific version of Ubuntu on a server with a TPM that doesn’t support SHA-256 and use systemd-cryptenroll to set up your disk encryption. But still, having any failure mode that involves always decrypting the data on your hard drive seems like a massive issue.

[†] You can enjoy the process too, all you need is the 1000-page TPM 2.0 Specification.

Takeaways

Ultimately, I didn’t actually find anything new. I was unknowingly just retracing the steps of some other developer years before me. But I learned a lot, so I can’t be too disappointed. Along with a bunch of esoterica about the linux boot process and disk encryption techniques, this really drove home a few lessons which I had heard before but never really internalized.

Cryptographic systems should fail early and fail loudly.

This whole mess could have been avoided if the TPM just threw an error when trying to create an authorization policy for PCRs that were not active.

Even if it had broken the initial release of systemd-cryptenroll, it would have done so in a way that was easy for the developers to identify and fix. It would have been easy for the TPM to respond with “Error: PCR algorithm not supported”. But instead, the TPM silently accepted this obviously nonsensical policy (use PCRs that don’t exist) and then also silently failed open on every subsequent usage of that policy.

I guess this is why cryptographic library designers have worked so hard to remove developer choice from the tools they offer. A software dev generally shouldn’t have to decide which padding mode or hashing algorithm they want, all they need is a magic box that takes a magic key and does cryptography on some input data.

Test to ensure things fail correctly.

I wouldn’t have caught this issue at all if I hadn’t explicitly tested the case where the system was supposed to not boot. It’s easy to assume a setup like this is working once it boots up without prompting for a password. But the goal isn’t just to decrypt automatically, it’s to decrypt automatically only when the state is valid. This is definitely important for unit tests and test vectors too.

Looking foward, I definitely want to spend more time working on this topic. The setup in this article can still easily be bypassed since the whole initramfs isn’t encrypted, so an attacker can change it at will. There’s a few other areas to look at too, like setting up custom secure boot keys and allowing easier update/recovery.

(That’s it! First post down, hopefully many more to come in the future)