Quick Take Link to heading
I implemented Prometheus metrics in my Kubernetes admission controller to provide critical insights into its health, performance, and error patterns. This observability setup uses optimized histogram buckets for webhook latencies and includes a comprehensive test suite to ensure metrics are reliable even in edge cases.
Introduction Link to heading
In the previous parts of this series, I built a Kubernetes admission controller and set up integration testing to ensure it works properly. While having a well-tested controller is essential, I quickly realized I needed visibility into how it behaves in production. This is where observability comes in.
Observability encompasses logging, metrics, and tracing. In this article, I’ll focus on how I implemented Prometheus metrics to provide insights into my controller’s behavior.
Why Metrics Matter for Admission Controllers Link to heading
Admission controllers occupy a critical position in the Kubernetes request flow. Every API request that could modify resources passes through relevant admission controllers before being persisted. This means my controller could become a bottleneck or single point of failure if not working properly.
For my admission controller, I needed to know:
- Health: Is the controller alive and ready to accept requests?
- Performance: How quickly is it processing requests?
- Success Rate: Are requests being processed successfully or failing?
- Error Patterns: What types of errors are occurring and how frequently?
- Usage Patterns: Which endpoints are being called and how often?
I chose Prometheus metrics because they could answer all these questions while providing both real-time monitoring and historical data for analysis.
Designing My Metrics System Link to heading
When I started adding metrics to the controller, I was tempted to measure everything. My first draft had over 20 different metrics! After a few days of reflection, I realized this approach would create unnecessary complexity and potentially burden Prometheus with excessive cardinality.
I discussed best practices with Claude and eventually settled on these key metrics:
- Request Counters: Track the number of requests by path, method, and status code
- Duration Histograms: Measure request latency with buckets optimized for webhook workloads
- Error Counters: Track errors by path, method, and status code
- Health Gauges: Monitor readiness and liveness status
This focused approach gives me the insights I need without overwhelming the metrics system.
Implementation Details Link to heading
Here’s how I implemented these metrics:
Metric Registration Link to heading
The first step was setting up a metrics structure and registration function:
// metrics holds our Prometheus metrics
type metrics struct {
requestCounter *prometheus.CounterVec
requestDuration *prometheus.HistogramVec
errorCounter *prometheus.CounterVec
readinessGauge prometheus.Gauge
livenessGauge prometheus.Gauge
registry *prometheus.Registry
}
// initMetrics initializes Prometheus metrics with an optional registry
func initMetrics(reg prometheus.Registerer) (*metrics, error) {
if reg == nil {
reg = prometheus.DefaultRegisterer
}
m := &metrics{}
// Request counter
m.requestCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: metricsNamespace,
Name: "requests_total",
Help: "Total number of requests processed",
},
[]string{"path", "method", "status"},
)
if err := reg.Register(m.requestCounter); err != nil {
return nil, fmt.Errorf("could not register request counter: %w", err)
}
// ... more metrics registration ...
return m, nil
}
I decided to structure the code this way to allow for testing with a custom registry and to clearly separate metric creation from usage. This approach made unit testing much easier, as I could create isolated test registries.
Optimized Latency Buckets Link to heading
One of my biggest challenges was choosing appropriate histogram buckets for the duration metrics. The default Prometheus buckets are designed for general-purpose monitoring, but they’re not granular enough for webhook latencies, which typically respond within milliseconds.
var (
// Buckets optimized for webhook latencies: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s
webhookDurationBuckets = []float64{0.005, 0.010, 0.025, 0.050, 0.100, 0.250, 0.500, 1.000, 2.500, 5.000}
)
This distribution gives me fine-grained visibility for typical latencies (5-100ms) while still capturing outliers.
Middleware for Request Metrics Link to heading
To capture metrics for all requests, I implemented a middleware that wraps HTTP handlers:
func (m *metrics) metricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
wrapped := newStatusRecorder(w)
defer func() {
if err := recover(); err != nil {
// Log the panic
log.Error().
Interface("panic", err).
Str("stack", string(debug.Stack())).
Msg("Handler panic recovered")
// Set 500 status
wrapped.WriteHeader(http.StatusInternalServerError)
// Record error metrics
m.requestCounter.WithLabelValues(r.URL.Path, r.Method, "500").Inc()
m.errorCounter.WithLabelValues(r.URL.Path, r.Method, "500").Inc()
m.requestDuration.WithLabelValues(r.URL.Path, r.Method).Observe(time.Since(start).Seconds())
}
}()
next.ServeHTTP(wrapped, r)
// Record metrics after successful handling
m.requestCounter.WithLabelValues(r.URL.Path, r.Method, fmt.Sprintf("%d", wrapped.status)).Inc()
m.requestDuration.WithLabelValues(r.URL.Path, r.Method).Observe(time.Since(start).Seconds())
// Record errors (status >= 400)
if wrapped.status >= 400 {
m.errorCounter.WithLabelValues(r.URL.Path, r.Method, fmt.Sprintf("%d", wrapped.status)).Inc()
}
})
}
This middleware captures all the key metrics I need, including request counts, durations, and error counts. I spent extra time ensuring that metrics are recorded even in the case of panics – I learned this lesson the hard way when a previous service had issues but wasn’t recording metrics properly during failures.
Health Status Metrics Link to heading
For monitoring health, I implemented gauge metrics that reflect the readiness and liveness status:
func (m *metrics) updateHealthMetrics(ready, alive bool) {
// Convert bool to float64 (1 for true, 0 for false)
if ready {
m.readinessGauge.Set(1)
} else {
m.readinessGauge.Set(0)
}
if alive {
m.livenessGauge.Set(1)
} else {
m.livenessGauge.Set(0)
}
}
These gauges provide immediate visibility into the controller’s health status, making it easy to set up alerts and dashboards that show when the service is unhealthy.
Testing Metrics Implementation Link to heading
One of my priorities was creating a comprehensive test suite for metrics. I’ve been burned before by unreliable metrics that looked fine in normal operation but failed to record important data in edge cases.
Unit Testing Metrics Link to heading
For basic functionality, I created unit tests that verify metric registration and updates:
func TestMetricsInitialization(t *testing.T) {
tests := []struct {
name string
registry *prometheus.Registry
wantErr bool
}{
{
name: "successful initialization with default registry",
wantErr: false,
},
{
name: "initialization with custom registry",
registry: prometheus.NewRegistry(),
wantErr: false,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
reg := tt.registry
if reg == nil {
reg = prometheus.NewRegistry()
}
m, err := initMetrics(reg)
if tt.wantErr {
assert.Error(t, err)
assert.Nil(t, m)
return
}
assert.NoError(t, err)
assert.NotNil(t, m)
// Verify metrics are registered
metrics, err := reg.Gather()
require.NoError(t, err)
assert.NotEmpty(t, metrics, "Expected metrics to be registered")
})
}
}
Testing Edge Cases Link to heading
I was particularly concerned about edge cases, so I created tests that verify metrics are recorded correctly in challenging scenarios:
func TestMetricsMiddlewareEdgeCases(t *testing.T) {
// Helper function to create a large string of specified size in KB
createLargeString := func(sizeKB int) string {
chunk := strings.Repeat("x", 1024) // 1KB chunk
return strings.Repeat(chunk, sizeKB)
}
tests := []struct {
name string
path string
method string
statusCode int
expectedLabels map[string]string
requestBody string
responseBody string
handler http.HandlerFunc
sleep time.Duration
}{
{
name: "panicking handler",
path: "/panic",
method: "POST",
statusCode: http.StatusInternalServerError,
handler: func(w http.ResponseWriter, r *http.Request) {
panic("intentional panic for testing")
},
},
{
name: "slow handler",
path: "/slow",
method: "GET",
statusCode: http.StatusOK,
sleep: 2 * time.Second,
handler: func(w http.ResponseWriter, r *http.Request) {
time.Sleep(2 * time.Second)
w.WriteHeader(http.StatusOK)
},
},
// More test cases...
}
// Test implementation...
}
These tests helped me identify several subtle bugs in my initial implementation:
- I wasn’t properly recording the status code for panics
- Very slow requests weren’t being recorded in the correct histogram buckets
- There was a race condition in metrics recording for concurrent requests
Fixing these issues before deployment saved me from discovering them in production.
Exposing Metrics Securely Link to heading
Security was another key concern. Since I was already using TLS for my webhook endpoints, I decided to extend that to secure the metrics endpoint as well:
func (m *metrics) handler() http.Handler {
if m.registry != nil {
return promhttp.HandlerFor(m.registry, promhttp.HandlerOpts{})
}
return promhttp.Handler()
}
I set up the metrics endpoint with appropriate TLS configuration and documented the required ServiceMonitor configuration for secure scraping:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: pod-label-webhook
namespace: webhook-test
spec:
selector:
matchLabels:
app: pod-label-webhook
namespaceSelector:
matchNames:
- webhook-test
endpoints:
- port: metrics
scheme: https
tlsConfig:
ca:
secret:
name: webhook-metrics-cert
key: ca.crt
cert:
secret:
name: webhook-metrics-cert
key: tls.crt
keySecret:
name: webhook-metrics-cert
key: tls.key
interval: 30s
scrapeTimeout: 10s
path: /metrics
This configuration ensures that only authenticated Prometheus instances can scrape our metrics, preventing unauthorized access to potentially sensitive data.
Visualizing Metrics with Grafana Link to heading
To make the metrics immediately useful, I created a Grafana dashboard:
{
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "reqps"
},
"overrides": []
},
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"pluginVersion": "10.0.0",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "sum(rate(pod_label_webhook_requests_total[5m]))",
"refId": "A"
}
],
"title": "Request Rate",
"type": "gauge"
},
// More panels...
]
}
The dashboard includes these key panels:
- Request Rate gauge
- Request Duration (P95) time series
- Readiness Status indicator
- Error Rate by Path time series
This dashboard provides a comprehensive view of the controller’s performance and health.
Example PromQL Queries Link to heading
To make the metrics immediately useful for my team, I documented some example PromQL queries:
Request Rate Link to heading
# Request rate over the last 5 minutes
rate(pod_label_webhook_requests_total[5m])
# Error rate over the last 5 minutes
rate(pod_label_webhook_errors_total[5m])
Latency Link to heading
# 95th percentile latency over the last hour
histogram_quantile(0.95, sum(rate(pod_label_webhook_request_duration_seconds_bucket[1h])) by (le))
# Average request duration
rate(pod_label_webhook_request_duration_seconds_sum[5m]) /
rate(pod_label_webhook_request_duration_seconds_count[5m])
Health Status Link to heading
# Current readiness status
pod_label_webhook_readiness_status
# Current liveness status
pod_label_webhook_liveness_status
I keep these queries handy in a team documentation page for quick reference during troubleshooting.
Setting Up Alerts Link to heading
Finally, I configured alerts for common failure scenarios:
groups:
- name: pod-label-webhook
rules:
- alert: HighErrorRate
expr: |
sum(rate(pod_label_webhook_errors_total[5m])) /
sum(rate(pod_label_webhook_requests_total[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: High error rate in pod-label-webhook
description: Error rate is above 10% for the last 5 minutes
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(pod_label_webhook_request_duration_seconds_bucket[5m]))
by (le)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: High latency in pod-label-webhook
description: 95th percentile latency is above 1 second for the last 5 minutes
- alert: WebhookNotReady
expr: pod_label_webhook_readiness_status == 0
for: 5m
labels:
severity: critical
annotations:
summary: Pod label webhook is not ready
description: Readiness probe has been failing for 5 minutes
These alerts have already saved me from several potential outages by providing early warning of issues before they became critical.
Lessons Learned Link to heading
Implementing this metrics system taught me several valuable lessons:
-
Optimize Histogram Buckets: Default Prometheus histogram buckets are too coarse for webhook latencies. I found custom buckets provide much better visibility into the performance characteristics that actually matter for my service.
-
Panic Recovery Matters: I learned the hard way in a previous project that ensuring metrics are recorded even when handlers panic is crucial for diagnosing issues. I made sure to include comprehensive panic recovery in my metrics middleware.
-
Test Edge Cases: The most interesting metrics are often generated during abnormal conditions. Testing slow requests, large payloads, and error conditions revealed subtle bugs in my initial implementation that would have been difficult to catch in production.
-
Label Cardinality: I initially wanted to include more labels for finer-grained metrics but realized this would cause cardinality explosion. Choosing the right labels (path, method, status) provides useful insights without overwhelming Prometheus.
-
Security Integration: Integrating metrics with my existing TLS setup required extra work but ensured secure scraping without adding complexity to the deployment model.
Working with Claude on this implementation was particularly interesting. The AI excelled at generating the metrics infrastructure code and comprehensive tests, but I needed to guide the discussion around specific metrics choices and histogram bucket optimization based on my understanding of our production workloads.
Looking Forward Link to heading
With this metrics implementation in place, I now have deep visibility into my controller’s operation. In production, these metrics help me identify issues quickly, track performance over time, and ensure high availability.
In the next article, I’ll explore how I’ve been using AI as a code reviewer to identify potential issues, suggest optimizations, and improve the overall quality of my controller. This approach has been instrumental in refining the implementation beyond what was initially generated.
The complete metrics implementation can be found in my GitHub repository, along with the dashboard configuration and documentation.