According to CNCF 2019 survey, more than 84% of respondents are using containers in production, and more than 78% respondents are using Kubernetes in Production. We also know that, 84% of all kubernetes workloads in public cloud run on AWS. With the commercial launch of Amazon Elastic Kubernetes Service (EKS) in June 2018, there has been a significant adoption of using EKS for running kubernetes workloads on AWS. And now, with AWS Fargate on Amazon EKS, which was launched at re:Invent 2019, there will be even greater adoption of EKS for all kubernetes workloads.
However, best practices, and solutions for implementing those best practices, for successfully running secure, reliable, efficient, and cost effective kubernetes workloads in production on EKS are not readily available. Different elements of the content are spread across a large number of blog posts, white papers, and vendor publications, and mostly are not specific to EKS.
This blog is an attempt to provide this information in a concise, easy to use format. To ensure completeness of coverage, and to achieve consistency with best practices for other AWS services and solution, the information is organized using AWS Well Architected Framework (WAF).
Operational Excellence
The Operational Excellence pillar includes the ability to run and monitor
systems to deliver business value.
|
Checklist Items |
Best Practices |
Solution Options (AWS) |
Solution Options (Partners/Open Source) |
1 |
Kubernetes Version Updates and Patches |
· Implement a documented and operational
update and upgrade program · Pre-upgrade checks: o
Kubernetes
release notes o
platform
version notes o
Control
plane and API server compatibility o
Other
plug-in versions · Test, update, and upgrade: o
control
plane (EKS) o
worker
nodes |
· Eksktl |
|
2 |
DevOps |
· Build immutable images · Use ConfigMap,
instead of storing configuration information in images |
· AWS CDK |
· CircleCI |
3 |
Observability |
· Enable high-level view into your running
clusters · Configure timely incident alerts when
something goes wrong. · Define and deploy processes and tools in
place to act on incident alerts |
· Auditing and Logging: Amazon CloudWatch logs for EKS control plane, AWS CloudTrail for EKS API · Tracing: AWS X-Ray · Monitoring & Alerting: · Observability: CloudWatch ServiceLens · Analytics: CloudWatch Container Insights, CloudWatch ServiceLens · Automation: AWS Lambda, Amazon EventBridge, Auto Scaling
Groups (ASG) |
· Logging: Datadog · EFK Monitoring and Alerting: Prometheus, AlertManager,
PagerDuty |
5 |
Service Discovery |
· Run CoreDNS on
each worker nodes · To discover services running outside the
cluster: o
Use Kubernetes
“service” object without pod selector o
Use Kubernetes
“ExternalName” service type |
|
|
6 |
Kubernetes Namespace |
· Use namespaces for easier resource
management. · Define and enforce a namespace naming
convention |
|
|
7 |
Service Mesh |
Service Mesh benefits: · standardizes how your services communicate · end-to-end visibility · end-to-end security · high-availability · monitoring and dynamically controlling communications
between services · make it easier to deploy new versions of
your services |
AWS App Mesh |
|
8 |
Single or Multiple Clusters |
· A separate cluster per environment (dev,
staging, prod, etc.) · Start with a single production cluster · explore multiple clusters to support
specific requirements o
security
or compliance requirements to isolate certain workloads o
extremely
highly variability in scaling and network load requirements between workloads o
customer
geographic distribution requiring clusters in different regions |
|
Security
The Security pillar includes the ability to protect information, systems, and
assets while delivering business value through risk assessments and mitigation
strategies.
|
Checklist Items |
Best Practices |
Solution Options (AWS) |
Solution Options (Partners/Open Source) |
1 |
Secrets
Management |
Mount
Secrets as volumes, not environment variables |
||
2 |
Container
Runtime Security |
· Prevent
containers from running as root (All processes in a container run as the root
user (uid 0), by default) · Disallow
privileged containers · Disallow
adding new capabilities. Ensure that application pods cannot add new
capabilities at runtime. · Disallow
changes to kernel parameters · Disallow
use of bind mounts (hostPath volumes) · Disallow
access to the docker socket bind mount · Disallow
use of host network and ports (allows potential snooping of network traffic
across application pods.) · Use
a read-only root filesystem in containers |
EKS
on Fargate (for VM isolation at pod level) |
Pod
Security Policy (PSP) |
3 |
Pod
communications control |
Enable
Kubernetes “network policies” to prevent unauthorized access, improve
security, and segregate namespaces. |
VPC
CNI + Calico |
|
4 |
Kubernetes
RBAC |
· Disable
auto-mounting of the default ServiceAccount · RBAC
policies are set to the least amount of privileges necessary · RBAC
policies are granular and not shared · Avoid
using wildcards in “roles” and “clusterroles” |
· Configure IAM users/groups mapping to
Kubernetes RBAC roles · Configure IAM roles for service accounts, if
a pod needs access to AWS resources. |
|
5 |
Cluster
security benchmark |
Cluster
passes the CIS Kubernetes Benchmark tests |
||
6 |
DevSecOps |
· Secure
credentials for CI to push and for cluster to pull images. · Automate
the scanning of vulnerabilities in your container images, implemented at the
CI stage of your pipeline. · Allow
deploying containers only from known registries |
Amazon
ECR: · Use ECR PrivateLink
Endpoint Policies for fine-grained IAM based access control. |
· Use
Open Policy Agent (OPA) · Partner
Solution for OPA: styra.com |
Reliability
The Reliability pillar includes the ability of a system to recover from
infrastructure or service disruptions, and dynamically acquire computing
resources to meet demand.
|
Checklist Items |
Best Practices |
Solution Options (AWS) |
Solution Options (Partners/Open Source) |
1 |
Disaster
Recovery |
· Practice
“infrastructure as code” with fully automated CI/CD pipelines for easier
cluster installs and upgrades. · Gitops practices and implementation to
recreate a cluster from git |
· Amazon
EKS (multi-master and multi-AZ) · Amazon EBS
and Amazon EFS
as “persistent volumes” for stateful applications. · Amazon
S3, Amazon DynamoDB, and Amazon RDS for external data storage · Amazon
ElastiCache for Redis for
session data storage, and for in-memory cache |
|
2 |
High
Availability |
· Create
worker nodes in Multi-AZ · Deploy
pods on multiple nodes: set anti-affinity rules · Deploy
pods in Multi-AZ |
· Enable
NLB (Network Load Balancer) in Multi-AZ, and/or enable cross-zone load
balancing. · Configure
Auto Scaling Groups
(ASG) per AZ |
|
3 |
Scalability |
· Use
the Horizontal Pod Autoscaler (HPA) for apps with
variable usage patterns · Use
the Cluster Autoscaler (CA) for varying workloads · For
stateful applications using EBS backed storage, configure multiple node
groups, each scoped to a single AZ. In addition, you should enable the
--balance-similar-node-groups feature |
· AWS
Fargate for Amazon EKS (for fully managed
autoscaling) |
|
4 |
Pod
IP address inventory and ENI Management |
· Size
subnets appropriately to have sufficient addresses for pods · Worker
nodes instance size should be selected to support expected number of pods,
which could be limited by number of ENI's that could be attached to an
instance. · Number
of pods running on a cluster may also be limited by number of VPC secondary
CIDR addresses available |
Assign
Secondary CIDR ranges (non-RFC 1918 addresses) to VPC, if needed |
|
4 |
Graceful
pod shutdown |
Implement lifecycle policy in podspec so a pod doesn't
shut down on SIGTERM, but gracefully terminates connections |
|
|
5 |
Health
checks |
Set
appropriate readiness probe and liveness probe values for containers |
|
|
Performance Efficiency
The Performance Efficiency pillar includes the ability to use computing resources
efficiently to meet system requirements, and to maintain that efficiency as
demand changes and technologies evolve.
|
Checklist Items |
Best Practices |
Solution Options (AWS) |
Solution Options (Partners/Open Source) |
1 |
fine
tuning cluster performance |
· Use
optimized base images · Cluster
Autoscaler tuning:
Adjust the min/max size of a node group
directly in ASG |
Scaling
Kubernetes deployments with Amazon CloudWatch metrics |
|
2 |
External
Access and Traffic Routing |
Use
ingress controller |
· ALB
ingress controller · Integrate
ALB ingress controller with AWS App Mesh for standardized east-west and
north-south service communication |
· NGINX · Traefik · Gloo |
3 |
Resource
requests and limits |
· Set
memory limits and requests for all containers · Set
CPU limits after determining correct settings for your container. · use
a LimitRange object to define the standard size for
a container deployed in the current namespace. · Use
vertical pod autoscaler (VPA) in
recommendation-mode to get the right resource (CPU/memory) requests and
limits |
|
|
4 |
Windows
pods and containers on Windows worker nodes |
Since
Windows worker nodes support 1 ENI per node, which limits number of pods that
can run on it, so select EC2 instance type based on your workload needs |
· Use Auto Scaling Group (ASG) for Windows worker nodes for
scalability |
|
Cost Optimization
The Cost Optimization pillar includes the ability to run systems to deliver
business value at the lowest price point. A
blog was recently published which nicely covers this pillar for EKS. So, I
am only summarizing the findings of that blog here.
|
Checklist Items |
Best Practices |
Solution Options (AWS) |
Solution Options (Partners/Open Source) |
1 |
Lowering costs |
· Identify actual CPU utilization by pods to set CPU
request values · Use “Vertical Pod Autoscaler”
in recommendation mode · Shutdown or scale down cluster at off-peak times |
Conclusion
The AWS Well-Architected Framework provides architectural best practices across the five pillars for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. In this blog post, I have compiled the best practices for EKS, and paired them with implementation solutions, provided by AWS and/or by AWS partners or open source. Using the WAF in your architecture will help you produce stable and efficient systems, which allow you to focus on your functional requirements.