In January 2018, Microsoft released an advisory and security updates related to a newly discovered class of hardware vulnerabilities involving speculative execution side channels (known as Spectre and Meltdown) that affect AMD, ARM, and Intel CPUs to varying degrees. If you haven’t had a chance to learn about these issues, we recommend watching The Case of Spectre and Meltdown by the team at TU Graz from BlueHat Israel, reading the blog post by Jann Horn (@tehjh) of Google Project Zero.
We have also had multiple posts detailing the internals of our implementation to handle these side-channel attacks.
For today’s post, we have kernel developers Andrea Allievi and Chris Kleynhans describing our design and implementation of retpoline for Windows which improves performance of Spectre variant 2 mitigations (CVE-2017-5715) to noise-level for most scenarios. These improvements are available today in Windows Insider Builds (builds 18272 or newer, x64-only).
At a high level, the Spectre variant 2 attack exploits indirect branches to steal secrets located in higher privilege contexts (e.g. kernel-mode vs user-mode). Indirect branches are instructions where the target of the branch is not contained in the instruction itself, such as when the destination address is stored in a CPU register.
Describing the full Spectre attack is outside the scope of this article. Details are in the links above or in this whitepaper from Intel.
Our original mitigations for Spectre variant 2 made use of new capabilities exposed by CPU microcode updates to restrict indirect branch speculation when executing within kernel mode (IBRS and IBPB). While this was an effective mitigation from a security standpoint, it resulted in a larger performance degradation than we’d like on certain processors and workloads.
For this reason, starting in early 2018, we investigated alternatives and found promise in an approach developed by Google called retpoline. A full description of retpoline can be found here, but in short, retpoline works by replacing all indirect call or jumps in kernel-mode binaries with an indirect branch sequence that has safe speculation behavior.
This sequence, shown below in Figure 1, effects a safe control transfer to the target address by performing a function call, modifying the return address and then returning.
RP0: call RP2 ; push address of RP1 onto the stack and jump to RP2 RP1: int 3 ; breakpoint to capture speculation RP2: mov [rsp], <Jump Target> ; overwrite return address on the stack to desired target RP3: ret ; return
While this construct is not as fast as a regular indirect call or jump, it has the side effect of preventing the processor from unsafe speculative execution. This proves to be much faster than running all of kernel mode code with branch speculation restricted (IBRS set to 1). However, this construct is only safe to use on processors where the RET instruction does not speculate based on the contents of the indirect branch predictor. Those processors are all AMD processors as well as Intel processors codenamed Broadwell and earlier according to Intel’s whitepaper. Retpoline is not applicable to Skylake and later processors from Intel.
Traditionally the transformation of indirect calls and jumps into retpolines is performed when a binary is built by the compiler. However, there are several functional requirements in Windows that make a purely compile-time implementation insufficient.
These key requirements are:
To satisfy requirement 1 and 3, we decided that binaries would ship in a non-retpolined state and then be transformed into a retpolined state by rewriting the code sequences for all indirect calls. This ensures that systems that do not use retpoline can use the binaries as compiled without needing any support for retpoline and with minimal runtime cost.
However, performing the transformation at runtime does lead to one problem. How do we know what transformations need to be applied? Disassembling and analyzing driver machine code to locate all indirect calls is not practical.
To solve this problem, we collaborated with the compiler team in Visual Studio to develop a system whereby the compiler can emit a new type of metadata into driver binaries describing each indirect call or jump in the system. This metadata takes the form of new relocation entries in the Dynamic Value Relocation Table (DVRT).
The DVRT was originally introduced back in the Windows 10 Creators Update to improve kernel address space layout randomization (KASLR). It allowed the memory manager’s page frame number (PFN) database and page table self-map to be assigned dynamic addresses at runtime. The DVRT is stored directly in the binary and contains a series of relocation entries for each symbol (i.e. address) that is to be relocated. The relocation entries are themselves arranged in a hierarchical fashion grouped first by symbol and then by containing page to allow for a compact description of all locations in the binary that reference a relocatable symbol.
At build time, the compiler keeps track of all references to these special symbols and fills out the DVRT. Then at runtime the kernel will parse the DVRT and update each symbol reference with the correct dynamically assigned address. Importantly, the kernel will skip over any DVRT entries it does not recognize (i.e. those with an unknown symbol) so adding new symbols to the DVRT does not break older versions of Windows.
These properties meant the DVRT was a perfect place to store our retpoline metadata, however the existing DVRT format needed to be extended to support retpoline.
Based on Windows requirements, we classified indirect calls/jumps into three distinct forms and each of these forms has its own type of retpoline relocation and corresponding runtime fixup.
Let’s talk a little about each of these types of calls.
Import calls/jumps are, as the name implies, used for calls/jumps made by a binary to functions that have been imported from another binary. When compiling with retpoline, the compiler ensures that all such calls conform to the following form:
48 FF 15 XX XX XX XX call qword ptr [_imp_<function>] 0F 1F 44 00 00 nop
The call or jmp instruction always directly references the import address table (IAT) and has 5 bytes of additional padding (to be used by the retpoline fixup).
Switchtable jumps are used for jumps made to other locations within the same function and are so-named because of their usage in implementing C/C++ switch statements. When compiling with retpoline support the compiler ensures that such calls are always made through a register and take the following form:
FF D0 jmp rax CC CC CC int 3
All other indirect calls/jumps fall into the generic type. To simplify the retpoline relocation format and the corresponding fixup logic, the compiler ensures that all such indirect calls/jumps provide their target address in the RAX register. The exact format of the call/jump instruction however differs depending on whether it is protected by control flow guard (CFG).
Now that we have a way to identify all the indirect calls/jumps in the binary, we need to apply the fixups.
The NT memory manager has long had infrastructure to apply fixups to binaries at runtime. This infrastructure was extended to understand retpoline relocations and their corresponding fixups.
But what exactly do these fixups look like? As mentioned earlier, the Windows implementation needs to support mixed environments in which some drivers are not compiled with retpoline support. This means that we cannot simply replace every indirect call with a retpoline sequence like the example shown in the introduction. We need to ensure that the kernel gets the opportunity to inspect the target of the call or jump so that it can apply appropriate mitigations if the target does not support retpoline.
For this reason, we transform every indirect call or jump into a direct call or jump to a kernel provided “retpoline stub function”. For example, an indirect call to an imported function that looks like this:
call qword ptr [_imp_ExAllocatePoolWithTag] ; Target address located at a REL32 offset nop ; Padding
Will be replaced at runtime with a direct call to the retpoline import stub:
mov r10, qword ptr [_imp_ExAllocatePoolWithTag] ; R10 = target address call _guard_retpoline_import_r10 ; Direct REL32 call to the stub function
There are several retpoline stub functions each of which is specialized to the type of call/jump it handles. However, each function generally performs the following steps:
The usage of these stub functions ensures that we can satisfy the requirement to support mixed environments, however they do introduce one additional problem. The x64 direct call/jump instruction can only encode a target address within 2 GB of the call-site (since the target is specified by a signed 16- or 32-bit offset). Since the retpoline stub functions are implemented in the NT kernel binary this would generally mean that drivers would have to be loaded within 2 GB of the kernel binary.
To work around this requirement, all retpoline stub functions are contained within a single section of the NT kernel binary and have been carefully written to take no dependencies on their position relative to the rest of the binary. This allows us to map the physical memory pages backing the retpoline stub functions immediately after every driver in the system, giving each driver its own “copy” of the retpoline stub functions that is guaranteed to be within 2 GB of every indirect call/jump.
Indirect calls due to imported functions are by far the most common form of indirect control transfers in kernel-mode. The import call targets are determined at driver load time by processing the import address table (IAT) and remain constant throughout the driver’s lifetime. This means that most of the work provided by the retpoline import stub is unnecessary because we know at driver load time exactly where each of these calls will end up going and we know whether the target binary supports retpoline or not. Hence, we can use a much faster calling sequence.
With import optimization, we use the retpoline fixup infrastructure to replace eligible import calls with direct calls to the imported function. This eliminates the overhead of the retpoline import call stub as well as the guaranteed branch prediction miss due to retpoline itself. To be eligible for import optimization, a call must meet the following requirements:
Here is an example of how the code generation for the call is modified.
Original code sequence
call [__imp_<Function>] ; Call to an imported function nop ; 5-byte nop
Import Optimized code sequence
mov r10, [__imp_<Function>] ; R10 = target address (normal transformation) call <Function> ; Direct REL32 call to target
Import optimization turned out to be a big performance win! Hence, even on processors where retpoline cannot be used due to alternate return instruction behavior, we still use import optimization.
Retpoline has significantly improved the performance of the Spectre variant 2 mitigations on Windows. When all relevant kernel-mode binaries are compiled with retpoline, we’ve measured ~25% speedup in Office app launch times and up to 1.5-2x improved throughput in the Diskspd (storage) and NTttcp (networking) benchmarks on Broadwell CPUs in our lab. It is enabled by default in the latest Windows Client Insider Fast builds (for builds 18272 and higher on machines exposing compatible speculation control capabilities) and is targeted to ship with 19H1.
To check if retpoline and import optimizations are enabled, you can use the PowerShell cmdlet Get-SpeculationControlSettings. You can also use NtQuerySystemInformation to programmatically query retpoline status.
For a more in-depth look, here is a talk by Andrea Allievi at BlueHat 2018 talking about retpoline on Windows.
Give the latest builds a try and let us know your experience!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.