Hi,
AKS takes more and more space in the Azure landscape, and there are a few best practices that you can follow to harden the environment and make it as secure as possible. As a preamble, remember that containers all share the kernel through system calls, so the level of isolation in the container world is not as strong as with virtual machines, and even more as with physical hosts. Mistakes can quickly lead to security issues.
This might sound obvious but one of the best ways to defend against malicious attacks, is to use bullet proof code. There is no way you'll be 100% bullet proof, but a few steps can be taken to maximize the robustness:
I've seen countless environments where the docker image itself is not hardened properly. I wrote a full blog post about this, so feel free to to read it https://techcommunity.microsoft.com/t5/azure-developer-community-blog/hardening-an-asp-net-container.... I took an ASP.NET code base as an example, but this is applicable to other languages. I'll summarize it here, in a nutshell:
Most of the times, we are using base images to build our own images, and most of the times, these base images have vulnerabilities. Use specialized software such as Snyk, Falco, Cloud Defender for Containers, etc. to identify them. Once identified, you should:
To push the shift left principle to the maximum, you can use Snyk's docker scan operation, right from the developer's machine to already identify vulnerabilities. Although Snyk is a paid product, you can scan a few images for free.
In the same post as before (https://techcommunity.microsoft.com/t5/azure-developer-community-blog/hardening-an-asp-net-container...), I also explain how to harden the K8s deployment itself. In a nutshell,
Although this might not be seen as a potential security issue, not specifying memory requests and limits can lead to an arbitrary eviction of other pods. Malicious users can take advantage of this to spread chaos within your cluster. So, you must always declare memory request and limits. You can optionally declare CPU requests/limits but this is not as important as memory.
K8s is a world where logical isolation takes precedence over physical isolation. So, whatever you do, you should make sure to adhere to the least privilege principle through proper RBAC configuration and proper network policies to control network traffic within the cluster, and potentially going outside (egress). Remember that by default, K8s is totally open, so every pod can talk to any other pod, whatever namespace it is located in. If you can't live with internal logical isolation only, you can also segregate workloads into different node pools and leverage Azure networking features such as NSGs to control network traffic at another level. I wrote an entire blog post on this: AKS, the elephant in the hub & spoke room, deep dive
Role-based access control can be configured for both humans and systems, thanks to Azure AD and K8s RBAC. There are multiple flavors available for AKS. Whichever one you use, you should make sure to:
Traffic can be controlled using plain K8s network policies or tools such as Calico. Network policies can be used to control pod-level ingress/egress traffic.
Because defense-in-depth relies on multiple ways to validate whether an ongoing operation is legal or not, you should also use a layer-7 protection, such as a Service Mesh or Dapr, which has some overlapping features with service meshes. The main difference between Dapr and a true Service Mesh is that applications using Dapr must be Dapr-aware while they don't need to know anything about a service mesh. The purpose of a layer-7 protection is to enable mTLS and fine-grained authorizations, in order to specify who can talk to who (on top of network policies). Most solutions today allow for fine-grained authorizations targeting operation-level scopes, when dealing with APIs. Dapr and Service Meshes come with many more juicy features that make you understand what a true Cloud native environment is.
Azure Policy is the corner stone of a tangible governance in Azure in general, and AKS makes no exception. With Azure Policy, you'll have a continuous assessment of your cluster's configuration as well as a way to control what can be deployed to the cluster. Azure Policy leverages Gatekeeper to deny non-compliant deployments. You can start smoothly in non-production by setting everything to Audit mode and switch to Deny in production. Azure Policy also allows you to whitelist known registries to make sure images cannot be pulled from everywhere.
Microsoft recently merged Defender for Registries and Defender for Kubernetes into Defender for Containers. There is a little bit of overlap with Azure Policy, but Defender also deploys DaemonSets that check for real-time threats. All incidents are categorized using the popular MITRE ATT&CK framework. One of the selling point is that Defender can handle any cluster, whether hosted on Azure or not. So, it is a multi-cloud solution. On top of assessing configuration and threats, Defender also ships with a built-in image scanning process leveraging Qualys behind the scenes. Images are scanned upon push operations as well as continuously to detect newer vulnerabilities that came after the push. Of course, there are other third party tools available such as Prisma Cloud, which you might be tempted to use, especially if you already run the Palo Alto NVAs.
This one is an easy one. Make sure to isolate the API server from internet. You can easily do that using Azure Private Link. If you can't do it for some reasons, try to at least restrict access to authorized IP address ranges.
Of course, an AKS cluster is by design inside an Azure Virtual Network. The cluster can expose some workloads outside through the use of an ingress controller, and anything can potentially go outside of the cluster, through an egress controller and/or an appliance controlling the network traffic.
Ingress can either be internet-facing callers or internal callers. A best practice is to isolate the AKS ingress controller (NGINX, Traefik, AGIC, etc.) from internet. You link it to an internal load balancer. Traffic that must be exposed to internet should be exposed through an Application Gateway, Front Door (using Private Link Service) or any other well-known non-Azure solution such as Barracuda, F5 etc. You should also distinguish pure UI traffic from API traffic. API traffic should also be filtered using an API gateway such as Azure APIM, Kong, Ambassador, etc. For "basic" scenarios, you might also offload JWT token validation to service meshes, but they will not have comparable features. You should for sure consider real API gateways for internet-facing APIs.
Pod-level egress traffic can be controlled by network policies or Calico, but also by most Service Meshes. Istio has even a dedicated egress controller, which can act as a proxy. On top of handling egress from within the cluster itself, it is a best practice to have a next-gen firewall waiting outside, such as Azure Firewall or third-party Network Virtual Appliances (NVA).
You start with one cluster, then 2, then a hundred. To keep some sort of consistency across cluster configurations, you can leverage Azure Policy. If your clusters are using on-premises or in another cloud, you can also use Azure Arc. Microsoft recently launched Azure Kubernetes Fleet Manager, which I haven't tried yet but is surely something to keep an eye on.
The above tips are by no means exhaustive but if you start with that, you should be in a better position when it comes to handling container security. There are a myriad of tools available on the market to better handle container security. Azure has some built-in capabilities and it is up to you to see if you prefer to use best of breed or best of suite. Note that more and more Azure native tools span beyond Azure itself, so your single pane of glasses could be Azure.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.