Skip to content

Controller metrics for EFS volumes provisioning #1787

@imunhatep

Description

@imunhatep

Is your feature request related to a problem?/Why is this needed

The EFS CSI driver currently does not expose Prometheus metrics for the volume provisioner controller. This makes it difficult to monitor, debug, and analyze provisioning performance or failures within Kubernetes environments.
In large-scale clusters or environments with dynamic provisioning patterns, the lack of visibility creates operational blind spots — for example, identifying slow provisions, tracking error rates, or understanding provisioning throughput over time.
Enhanced observability would significantly improve operational insights for SRE/Platform teams relying on the EFS CSI driver.

/feature

Describe the solution you'd like in detail
Add Prometheus metrics to the EFS CSI provisioner controller similar to what other CSI drivers expose. Ideally, metrics would cover:

  • Total number of volume provisioning attempts
  • Successful provision operations
  • Failed provision operations
  • Provisioning latency (histogram)
  • Controller runtime health metrics
  • Optional: per‑reason failure bucketing
  • Optional: per‑access point or per‑filesystem metrics (if feasible)

Metrics should be registered through the standard Kubernetes/CSI instrumentation patterns, and exposed on a dedicated metrics endpoint so they can be scraped by Prometheus and shipped to monitoring solutions such as Grafana, Datadog, or CloudWatch Metrics via exporters.

Describe alternatives you've considered

  • External logging analysis: Currently the only option is to parse controller logs and infer provisioning performance or errors, which is unreliable and not suitable for real‑time monitoring.
  • AWS CloudWatch EFS metrics: These provide filesystem‑level insights but do not capture Kubernetes dynamic provisioning lifecycle events or CSI controller performance.
    Many other CSI drivers (e.g., EBS CSI, GCE PD CSI, Azure Disk CSI) already expose helpful Prometheus metrics for their controllers. Aligning EFS CSI with that ecosystem would improve observability consistency across clusters.
    This feature would help SRE teams establish SLOs for provisioning and improve debugging during outages or performance regressions.
    Happy to help test or validate the implementation if needed.

Sidecar instrumentation: Adding a proxy/sidecar to scrape logs and generate synthetic metrics — but this increases operational complexity and is still less accurate than native instrumentation.

None of the alternatives provide the precision, reliability, or ease of integration that native Prometheus metrics would.

Additional context

  • Many other CSI drivers (e.g., EBS CSI, GCE PD CSI, Azure Disk CSI) already expose helpful Prometheus metrics for their controllers. Aligning EFS CSI with that ecosystem would improve observability consistency across clusters.
  • This feature would help SRE teams establish SLOs for provisioning and improve debugging during outages or performance regressions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions