AKS Node Disk Pressure
What Happened?
Hey folks,
Let me share a real incident we faced in production on Azure Kubernetes Service (AKS). Our workloads were behaving oddly — pods getting evicted, app downtime alerts, and our monitoring tools screaming DiskPressure on some nodes.
We didn’t make any infra changes recently, so the obvious question was:
What’s going on inside the AKS nodes?
Root Cause Analysis
We dug into the node metrics using Azure Monitor and kubectl describe node. Here’s what we found:
- **DiskPressure Condition = True**
- Evicted pods had logs like:
The node was low on resource: ephemeral-storage.
Turns out, ephemeral storage on the nodes was filling up rapidly — mostly from:
Container logs
Image cache
Tmp files inside /var/lib/docker
Solution Applied Here’s how we fixed it step-by-step:
1. Enabled Log Rotation
We added a custom containerd config to enable log rotation on our AKS nodes.
# /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd]
max_log_size = "10MiB"
max_log_files = 3
Then restarted containerd:
sudo systemctl restart containerd
2. Cleaned Up Unused Docker Resources (Manual Step)
On affected nodes:
docker system prune -a
If you're using containerd, use crictl or ctr instead of Docker.
3. Implemented Auto Cleanup CronJob
We created a daemonset that periodically cleans up unused logs and image layers, to avoid manual interventions.
Long-Term Fix:
- We updated our AKS node pool disk size from 100Gi → 200Gi and set up proper alerts:
- Azure Monitor Alert on node disk usage > 70%
- Action Group integration to alert via Teams/Email
Final Thoughts:
- Logs are silent killers.
- Monitoring node storage is non-negotiable.
- Don’t rely on default log settings — customize them!
If you’re running AKS in production, please double-check your disk usage, retention, and alerting setup.