Adding Observability to a Kubernetes Controller

Quick Take Link to heading

I implemented Prometheus metrics in my Kubernetes admission controller to provide critical insights into its health, performance, and error patterns. This observability setup uses optimized histogram buckets for webhook latencies and includes a comprehensive test suite to ensure metrics are reliable even in edge cases.

Introduction Link to heading

In the previous parts of this series, I built a Kubernetes admission controller and set up integration testing to ensure it works properly. While having a well-tested controller is essential, I quickly realized I needed visibility into how it behaves in production. This is where observability comes in.

Observability encompasses logging, metrics, and tracing. In this article, I’ll focus on how I implemented Prometheus metrics to provide insights into my controller’s behavior.

Why Metrics Matter for Admission Controllers Link to heading

Admission controllers occupy a critical position in the Kubernetes request flow. Every API request that could modify resources passes through relevant admission controllers before being persisted. This means my controller could become a bottleneck or single point of failure if not working properly.

For my admission controller, I needed to know:

Health: Is the controller alive and ready to accept requests?
Performance: How quickly is it processing requests?
Success Rate: Are requests being processed successfully or failing?
Error Patterns: What types of errors are occurring and how frequently?
Usage Patterns: Which endpoints are being called and how often?

I chose Prometheus metrics because they could answer all these questions while providing both real-time monitoring and historical data for analysis.

Designing My Metrics System Link to heading

When I started adding metrics to the controller, I was tempted to measure everything. My first draft had over 20 different metrics! After a few days of reflection, I realized this approach would create unnecessary complexity and potentially burden Prometheus with excessive cardinality.

I discussed best practices with Claude and eventually settled on these key metrics:

Request Counters: Track the number of requests by path, method, and status code
Duration Histograms: Measure request latency with buckets optimized for webhook workloads
Error Counters: Track errors by path, method, and status code
Health Gauges: Monitor readiness and liveness status

This focused approach gives me the insights I need without overwhelming the metrics system.

Implementation Details Link to heading

Here’s how I implemented these metrics:

Metric Registration Link to heading

The first step was setting up a metrics structure and registration function:

// metrics holds our Prometheus metrics
type metrics struct {
    requestCounter  *prometheus.CounterVec
    requestDuration *prometheus.HistogramVec
    errorCounter    *prometheus.CounterVec
    readinessGauge  prometheus.Gauge
    livenessGauge   prometheus.Gauge
    registry        *prometheus.Registry
}

// initMetrics initializes Prometheus metrics with an optional registry
func initMetrics(reg prometheus.Registerer) (*metrics, error) {
    if reg == nil {
        reg = prometheus.DefaultRegisterer
    }

    m := &metrics{}

    // Request counter
    m.requestCounter = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Namespace: metricsNamespace,
            Name:      "requests_total",
            Help:      "Total number of requests processed",
        },
        []string{"path", "method", "status"},
    )
    if err := reg.Register(m.requestCounter); err != nil {
        return nil, fmt.Errorf("could not register request counter: %w", err)
    }

    // ... more metrics registration ...

    return m, nil
}

I decided to structure the code this way to allow for testing with a custom registry and to clearly separate metric creation from usage. This approach made unit testing much easier, as I could create isolated test registries.

Optimized Latency Buckets Link to heading

One of my biggest challenges was choosing appropriate histogram buckets for the duration metrics. The default Prometheus buckets are designed for general-purpose monitoring, but they’re not granular enough for webhook latencies, which typically respond within milliseconds.

var (
    // Buckets optimized for webhook latencies: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s
    webhookDurationBuckets = []float64{0.005, 0.010, 0.025, 0.050, 0.100, 0.250, 0.500, 1.000, 2.500, 5.000}
)

This distribution gives me fine-grained visibility for typical latencies (5-100ms) while still capturing outliers.

Middleware for Request Metrics Link to heading

To capture metrics for all requests, I implemented a middleware that wraps HTTP handlers:

func (m *metrics) metricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        wrapped := newStatusRecorder(w)

        defer func() {
            if err := recover(); err != nil {
                // Log the panic
                log.Error().
                    Interface("panic", err).
                    Str("stack", string(debug.Stack())).
                    Msg("Handler panic recovered")

                // Set 500 status
                wrapped.WriteHeader(http.StatusInternalServerError)

                // Record error metrics
                m.requestCounter.WithLabelValues(r.URL.Path, r.Method, "500").Inc()
                m.errorCounter.WithLabelValues(r.URL.Path, r.Method, "500").Inc()
                m.requestDuration.WithLabelValues(r.URL.Path, r.Method).Observe(time.Since(start).Seconds())
            }
        }()

        next.ServeHTTP(wrapped, r)

        // Record metrics after successful handling
        m.requestCounter.WithLabelValues(r.URL.Path, r.Method, fmt.Sprintf("%d", wrapped.status)).Inc()
        m.requestDuration.WithLabelValues(r.URL.Path, r.Method).Observe(time.Since(start).Seconds())

        // Record errors (status >= 400)
        if wrapped.status >= 400 {
            m.errorCounter.WithLabelValues(r.URL.Path, r.Method, fmt.Sprintf("%d", wrapped.status)).Inc()
        }
    })
}

This middleware captures all the key metrics I need, including request counts, durations, and error counts. I spent extra time ensuring that metrics are recorded even in the case of panics – I learned this lesson the hard way when a previous service had issues but wasn’t recording metrics properly during failures.

Health Status Metrics Link to heading

For monitoring health, I implemented gauge metrics that reflect the readiness and liveness status:

func (m *metrics) updateHealthMetrics(ready, alive bool) {
    // Convert bool to float64 (1 for true, 0 for false)
    if ready {
        m.readinessGauge.Set(1)
    } else {
        m.readinessGauge.Set(0)
    }

    if alive {
        m.livenessGauge.Set(1)
    } else {
        m.livenessGauge.Set(0)
    }
}

These gauges provide immediate visibility into the controller’s health status, making it easy to set up alerts and dashboards that show when the service is unhealthy.

Testing Metrics Implementation Link to heading

One of my priorities was creating a comprehensive test suite for metrics. I’ve been burned before by unreliable metrics that looked fine in normal operation but failed to record important data in edge cases.

Unit Testing Metrics Link to heading

For basic functionality, I created unit tests that verify metric registration and updates:

func TestMetricsInitialization(t *testing.T) {
    tests := []struct {
        name     string
        registry *prometheus.Registry
        wantErr  bool
    }{
        {
            name:    "successful initialization with default registry",
            wantErr: false,
        },
        {
            name:     "initialization with custom registry",
            registry: prometheus.NewRegistry(),
            wantErr:  false,
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            reg := tt.registry
            if reg == nil {
                reg = prometheus.NewRegistry()
            }

            m, err := initMetrics(reg)
            if tt.wantErr {
                assert.Error(t, err)
                assert.Nil(t, m)
                return
            }

            assert.NoError(t, err)
            assert.NotNil(t, m)

            // Verify metrics are registered
            metrics, err := reg.Gather()
            require.NoError(t, err)
            assert.NotEmpty(t, metrics, "Expected metrics to be registered")
        })
    }
}

Testing Edge Cases Link to heading

I was particularly concerned about edge cases, so I created tests that verify metrics are recorded correctly in challenging scenarios:

func TestMetricsMiddlewareEdgeCases(t *testing.T) {
    // Helper function to create a large string of specified size in KB
    createLargeString := func(sizeKB int) string {
        chunk := strings.Repeat("x", 1024) // 1KB chunk
        return strings.Repeat(chunk, sizeKB)
    }

    tests := []struct {
        name           string
        path           string
        method         string
        statusCode     int
        expectedLabels map[string]string
        requestBody    string
        responseBody   string
        handler        http.HandlerFunc
        sleep          time.Duration
    }{
        {
            name:       "panicking handler",
            path:       "/panic",
            method:     "POST",
            statusCode: http.StatusInternalServerError,
            handler: func(w http.ResponseWriter, r *http.Request) {
                panic("intentional panic for testing")
            },
        },
        {
            name:       "slow handler",
            path:       "/slow",
            method:     "GET",
            statusCode: http.StatusOK,
            sleep:      2 * time.Second,
            handler: func(w http.ResponseWriter, r *http.Request) {
                time.Sleep(2 * time.Second)
                w.WriteHeader(http.StatusOK)
            },
        },
        // More test cases...
    }
    
    // Test implementation...
}

These tests helped me identify several subtle bugs in my initial implementation:

I wasn’t properly recording the status code for panics
Very slow requests weren’t being recorded in the correct histogram buckets
There was a race condition in metrics recording for concurrent requests

Fixing these issues before deployment saved me from discovering them in production.

Exposing Metrics Securely Link to heading

Security was another key concern. Since I was already using TLS for my webhook endpoints, I decided to extend that to secure the metrics endpoint as well:

func (m *metrics) handler() http.Handler {
    if m.registry != nil {
        return promhttp.HandlerFor(m.registry, promhttp.HandlerOpts{})
    }
    return promhttp.Handler()
}

I set up the metrics endpoint with appropriate TLS configuration and documented the required ServiceMonitor configuration for secure scraping:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: pod-label-webhook
  namespace: webhook-test
spec:
  selector:
    matchLabels:
      app: pod-label-webhook
  namespaceSelector:
    matchNames:
      - webhook-test
  endpoints:
    - port: metrics
      scheme: https
      tlsConfig:
        ca:
          secret:
            name: webhook-metrics-cert
            key: ca.crt
        cert:
          secret:
            name: webhook-metrics-cert
            key: tls.crt
        keySecret:
          name: webhook-metrics-cert
          key: tls.key
      interval: 30s
      scrapeTimeout: 10s
      path: /metrics

This configuration ensures that only authenticated Prometheus instances can scrape our metrics, preventing unauthorized access to potentially sensitive data.

Visualizing Metrics with Grafana Link to heading

To make the metrics immediately useful, I created a Grafana dashboard:

{
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "reqps"
        },
        "overrides": []
      },
      "options": {
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true
      },
      "pluginVersion": "10.0.0",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prometheus"
          },
          "expr": "sum(rate(pod_label_webhook_requests_total[5m]))",
          "refId": "A"
        }
      ],
      "title": "Request Rate",
      "type": "gauge"
    },
    // More panels...
  ]
}

The dashboard includes these key panels:

Request Rate gauge
Request Duration (P95) time series
Readiness Status indicator
Error Rate by Path time series

This dashboard provides a comprehensive view of the controller’s performance and health.

Example PromQL Queries Link to heading

To make the metrics immediately useful for my team, I documented some example PromQL queries:

Request Rate Link to heading

# Request rate over the last 5 minutes
rate(pod_label_webhook_requests_total[5m])

# Error rate over the last 5 minutes
rate(pod_label_webhook_errors_total[5m])

Latency Link to heading

# 95th percentile latency over the last hour
histogram_quantile(0.95, sum(rate(pod_label_webhook_request_duration_seconds_bucket[1h])) by (le))

# Average request duration
rate(pod_label_webhook_request_duration_seconds_sum[5m]) /
rate(pod_label_webhook_request_duration_seconds_count[5m])

Health Status Link to heading

# Current readiness status
pod_label_webhook_readiness_status

# Current liveness status
pod_label_webhook_liveness_status

I keep these queries handy in a team documentation page for quick reference during troubleshooting.

Setting Up Alerts Link to heading

Finally, I configured alerts for common failure scenarios:

groups:
  - name: pod-label-webhook
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(pod_label_webhook_errors_total[5m])) /
          sum(rate(pod_label_webhook_requests_total[5m])) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High error rate in pod-label-webhook
          description: Error rate is above 10% for the last 5 minutes

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(pod_label_webhook_request_duration_seconds_bucket[5m]))
            by (le)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High latency in pod-label-webhook
          description: 95th percentile latency is above 1 second for the last 5 minutes

      - alert: WebhookNotReady
        expr: pod_label_webhook_readiness_status == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Pod label webhook is not ready
          description: Readiness probe has been failing for 5 minutes

These alerts have already saved me from several potential outages by providing early warning of issues before they became critical.

Lessons Learned Link to heading

Implementing this metrics system taught me several valuable lessons:

Optimize Histogram Buckets: Default Prometheus histogram buckets are too coarse for webhook latencies. I found custom buckets provide much better visibility into the performance characteristics that actually matter for my service.
Panic Recovery Matters: I learned the hard way in a previous project that ensuring metrics are recorded even when handlers panic is crucial for diagnosing issues. I made sure to include comprehensive panic recovery in my metrics middleware.
Test Edge Cases: The most interesting metrics are often generated during abnormal conditions. Testing slow requests, large payloads, and error conditions revealed subtle bugs in my initial implementation that would have been difficult to catch in production.
Label Cardinality: I initially wanted to include more labels for finer-grained metrics but realized this would cause cardinality explosion. Choosing the right labels (path, method, status) provides useful insights without overwhelming Prometheus.
Security Integration: Integrating metrics with my existing TLS setup required extra work but ensured secure scraping without adding complexity to the deployment model.

Working with Claude on this implementation was particularly interesting. The AI excelled at generating the metrics infrastructure code and comprehensive tests, but I needed to guide the discussion around specific metrics choices and histogram bucket optimization based on my understanding of our production workloads.

Looking Forward Link to heading

With this metrics implementation in place, I now have deep visibility into my controller’s operation. In production, these metrics help me identify issues quickly, track performance over time, and ensure high availability.

In the next article, I’ll explore how I’ve been using AI as a code reviewer to identify potential issues, suggest optimizations, and improve the overall quality of my controller. This approach has been instrumental in refining the implementation beyond what was initially generated.

The complete metrics implementation can be found in my GitHub repository, along with the dashboard configuration and documentation.