Helm in Production: Hard-Learned Lessons and Common Gotchas
Helm in production: Discover critical mistakes in 150+ deployments—resource limits, security gaps, and how to avoid costly 3 AM incidents.
Running Helm charts in production reveals problems that development environments never surface. After auditing 150+ production Helm deployments, patterns emerge — the same mistakes crash systems, waste resources, and create 3 AM incidents.
The Resource Limits Crisis Nobody Talks About
According to The Real State of Helm Chart Reliability (2025), 63% of production Helm charts ship without CPU limits. This creates a ticking time bomb: when one container hits heavy load, it consumes all available CPU on the node. Other pods — including your critical services — get CPU-starved and stop responding.
Real numbers: The same study found 60% of charts lack memory limits. Without these guardrails, a single memory leak in one container triggers OutOfMemory conditions that kill every pod on the node. Not just the leaking container — everything.
Put simply: most Helm charts in production right now are one spike away from taking down entire nodes.
Why Your Helm Rollbacks Keep Failing
Rolling back a failed deployment should be simple. Run helm rollback, everything returns to the previous state. Except it doesn't.
As noted in Troubleshooting Helm Deployments, rollbacks frequently fail to restore the correct state. The three-layer dependency between templates, values, and Kubernetes infrastructure creates misalignments. When rollback runs, ConfigMaps might not recreate properly. Stateful resources like databases remain in their failed state. Pods that rely on startup routines never reinitialize.
Honest take: Helm rollback is not a time machine. It only reverts the Kubernetes manifests — not your data, not your application state, not your external dependencies.
The Template Complexity Trap
One experienced Helm user shared their audit findings: "I've used Helm for 6+ years... Audited 150+ for clients. But most Helm charts are bloated, fragile, and impossible to scale."
The killer detail: "1 toggle breaks 3 other things." Teams add conditional logic to handle every possible configuration. Disable one feature in values.yaml, and cascading failures ripple through the chart. Another common sight: "20 templates in 1 chart… and no one knows what's actually deployed."
What this means for your project: Every conditional, every nested if-statement, every clever template trick increases the chance that your next deployment fails in ways you cannot predict.
Missing the Basics: Resource Requests
Here's what catches teams off guard: 51% of charts don't declare CPU requests. Without CPU requests, Kubernetes scheduler has no idea how much CPU your pod needs. It places pods randomly, leading to:
- Critical services landing on overloaded nodes
- Non-critical services consuming resources meant for production workloads
- Unpredictable performance as pods compete for CPU
The irony: adding CPU requests takes one line in your values.yaml. Half of production deployments skip this basic step.
Advanced Features: The 70% Gap
Production reliability features remain largely unused. According to the Helm reliability study:
- Topology spread constraints: missing in 70%+ of charts
- PodDisruptionBudgets: absent in over 70%
- Autoscaling configurations: not implemented in 70%+
These aren't nice-to-have features. Topology spread prevents all replicas from landing on the same node. PodDisruptionBudgets stop Kubernetes from terminating all your pods during node maintenance. Autoscaling handles traffic spikes without manual intervention.
Key takeaway for business: Your Helm charts probably lack the features that prevent downtime during normal Kubernetes operations.
Debugging Production Helm Failures
When production breaks, generic kubectl commands waste precious time. Troubleshooting Helm Deployments recommends this specific workflow:
- Run
helm templatefirst — catches rendering issues before they hit the cluster - Use
kubectl describeandkubectl logsfor runtime errors - Check
helm get valuesandhelm diffto debug configuration mismatches
The critical insight: "Helm relies on templates, values, and the underlying Kubernetes infrastructure. When any of these layers misalign, the deployment can fail or misbehave."
Permission Errors That Block Deployments
OneUptime's troubleshooting guide highlights a common production blocker:
Error: release my-release failed: deployments.apps is forbidden:
User "system:serviceaccount:default:default" cannot create resource "deployments"
Non-admin service accounts hit this wall. The fix requires proper RBAC configuration — something development environments with admin access never test.
The One Thing That Works: Health Checks
Amidst all these failures, one pattern succeeds: approximately 80% of charts implement readiness and liveness probes. Basic health monitoring has become standard practice.
This proves teams can adopt reliability features when they understand the value. The question becomes: why stop at health checks?
Here Is What We Recommend
Based on patterns from 150+ production deployments:
Immediate fixes (do today):
- Add CPU and memory limits to every container. Start with generous limits, tighten based on monitoring
- Set CPU requests to 50% of your limits as a baseline
- Implement PodDisruptionBudgets for any service with multiple replicas
Before your next deployment:
- Test rollback procedures in staging — including data restoration steps
- Run
helm template --validateandhelm lintin CI pipelines - Document which template changes require which values.yaml updates
Long-term improvements:
- Split complex charts into smaller, focused charts
- Add topology spread constraints for multi-node clusters
- Configure autoscaling for variable workloads
Honest take: Most teams discover these issues during outages. Learning them from this article costs nothing. Implementing these fixes takes hours. Recovering from production failures takes days — and customer trust.
Frequently Asked Questions
How should you bump the parent chart version in umbrella charts — only when a child chart changes, only when an application changes, or both?
Bump the parent chart version when either child charts or the parent's own templates change. This ensures deployments can track which combination of components they're running. Skip version bumps for documentation-only changes.
When using chart versus application versioning, how do you effectively track chart template changes separately from application version updates?
Use chart version for template/configuration changes and appVersion for application code changes. Chart version 2.1.0 with appVersion 1.5.0 tells you exactly which deployment logic wraps which application version. Include both in your release notes.
What's the difference between helm lint and helm template --validate, and how do they complement each other?
helm lint checks chart structure and best practices — missing values, deprecated APIs, formatting issues. helm template --validate renders templates and verifies the output creates valid Kubernetes resources. Run lint first to catch chart issues, then template to verify rendered manifests.
This article is based on publicly available sources and may contain inaccuracies.


