-
Notifications
You must be signed in to change notification settings - Fork 603
Description
Is your feature request related to a problem?/Why is this needed
The EFS CSI driver currently does not expose Prometheus metrics for the volume provisioner controller. This makes it difficult to monitor, debug, and analyze provisioning performance or failures within Kubernetes environments.
In large-scale clusters or environments with dynamic provisioning patterns, the lack of visibility creates operational blind spots — for example, identifying slow provisions, tracking error rates, or understanding provisioning throughput over time.
Enhanced observability would significantly improve operational insights for SRE/Platform teams relying on the EFS CSI driver.
/feature
Describe the solution you'd like in detail
Add Prometheus metrics to the EFS CSI provisioner controller similar to what other CSI drivers expose. Ideally, metrics would cover:
- Total number of volume provisioning attempts
- Successful provision operations
- Failed provision operations
- Provisioning latency (histogram)
- Controller runtime health metrics
- Optional: per‑reason failure bucketing
- Optional: per‑access point or per‑filesystem metrics (if feasible)
Metrics should be registered through the standard Kubernetes/CSI instrumentation patterns, and exposed on a dedicated metrics endpoint so they can be scraped by Prometheus and shipped to monitoring solutions such as Grafana, Datadog, or CloudWatch Metrics via exporters.
Describe alternatives you've considered
- External logging analysis: Currently the only option is to parse controller logs and infer provisioning performance or errors, which is unreliable and not suitable for real‑time monitoring.
- AWS CloudWatch EFS metrics: These provide filesystem‑level insights but do not capture Kubernetes dynamic provisioning lifecycle events or CSI controller performance.
Many other CSI drivers (e.g., EBS CSI, GCE PD CSI, Azure Disk CSI) already expose helpful Prometheus metrics for their controllers. Aligning EFS CSI with that ecosystem would improve observability consistency across clusters.
This feature would help SRE teams establish SLOs for provisioning and improve debugging during outages or performance regressions.
Happy to help test or validate the implementation if needed.
Sidecar instrumentation: Adding a proxy/sidecar to scrape logs and generate synthetic metrics — but this increases operational complexity and is still less accurate than native instrumentation.
None of the alternatives provide the precision, reliability, or ease of integration that native Prometheus metrics would.
Additional context
- Many other CSI drivers (e.g., EBS CSI, GCE PD CSI, Azure Disk CSI) already expose helpful Prometheus metrics for their controllers. Aligning EFS CSI with that ecosystem would improve observability consistency across clusters.
- This feature would help SRE teams establish SLOs for provisioning and improve debugging during outages or performance regressions.