Skip to content

Cascading failures in resource List handlers cause cleanup evasion #907

@aws-khargita

Description

@aws-khargita

Summary

Following the fix for CloudWatchLogsLogGroup in #842, an audit of all resource handlers in the repository reveals that ~90 resource handlers exhibit the same cascading failure pattern. When a per-resource enrichment API call (e.g., ListTagsForResource, DescribeCluster, etc.) fails for a single resource inside a List function, the entire listing aborts with return nil, err — causing all resources of that type to silently evade cleanup.

The Bug Pattern

Inside List methods, after the main listing/pagination call succeeds, there are often secondary API calls made per resource to enrich individual items with tags, descriptions, or other metadata. When these per-resource calls use return nil, err on failure, it causes a cascading failure:

// ❌ BUG: one tag failure kills discovery of ALL resources
for _, item := range resp.Items {
    tags, err := svc.ListTagsForResource(ctx, &svc.ListTagsForResourceInput{
        ResourceArn: item.Arn,
    })
    if err != nil {
        return nil, err // <-- ALL resources lost
    }
    resources = append(resources, &MyResource{Tags: tags})
}

The fix is to log a warning and continue, skipping only the problematic resource:

// ✅ FIX: skip the problematic resource, continue discovering others
for _, item := range resp.Items {
    tags, err := svc.ListTagsForResource(ctx, &svc.ListTagsForResourceInput{
        ResourceArn: item.Arn,
    })
    if err != nil {
        logrus.WithError(err).WithField("arn", *item.Arn).
            Warn("unable to list tags, skipping resource to avoid incorrect filtering")
        continue
    }
    resources = append(resources, &MyResource{Tags: tags})
}

Note: The main listing/pagination call itself (e.g., paginator.NextPage) should still return nil, err — that's correct behavior. Only the secondary per-resource calls inside the loop need this fix.

Real-World Impact

This was originally discovered via an SCP blocking ListTagsForResource on CloudWatch Log Groups (#842). Any similar SCP, permission boundary, or transient API error on a per-resource enrichment call will cause the same problem for any of the ~90 affected resources listed below.

Affected Resources

Category 1: Tag Fetching Failures (ListTagsForResource / ListTags / DescribeTags)

Resource File Buggy Call
cloudfront-distribution.go ListTagsForResource per distribution
cloudwatch-alarm.go ListTagsForResource per alarm (2 occurrences: metric + composite)
ecr-public-repository.go ListTagsForResource per repo
ecr-repository.go ListTagsForResource per repo
efs-filesystem.go ListTagsForResource per filesystem
efs-mount-targets.go ListTagsForResource per filesystem
elasticsearchservice-domain.go ListTags per domain
elb-elb.go DescribeTags in batches of 20
elbv2-alb.go DescribeTags in batches of 20
elbv2-targetgroup.go DescribeTags in batches of 20
iotsitewise-asset-model.go ListTagsForResource per model
iotsitewise-asset.go ListTagsForResource per asset
iottwinmaker-component-type.go ListTagsForResource per type
iottwinmaker-workspace.go ListTagsForResource per workspace
neptune-graph.go ListTagsForResource per graph
opensearchservice-domain.go ListTags per domain
rds-snapshots.go ListTagsForResource per snapshot
rds-cluster-snapshots.go ListTagsForResource per snapshot
route53-health-checks.go ListTagsForResource per check
route53-hosted-zone.go ListTagsForResource per zone
shield-protection.go ListTagsForResource per protection
shield-protection-group.go ListTagsForResource per group
ssm-parameters.go ListTagsForResource per parameter
bedrock-custom-models.go ListTagsForResource per model
bedrock-evaluation-jobs.go ListTagsForResource per job
bedrock-guardrails.go ListTagsForResource per guardrail
bedrock-model-customization-jobs.go ListTagsForResource per job
bedrock-provisioned-model-throughputs.go ListTagsForResource per throughput
acm-certificate.go ListTagsForCertificate per cert
acm-pca-certificate-authority.go ListTags per CA
acm-pca-certificate-authority-state.go ListTags per CA

Category 2: Describe/Get Enrichment Failures

Resource File Buggy Call
acm-certificate.go DescribeCertificate per cert
codeartifact-domains.go DescribeDomain per domain
codestar-notifications.go DescribeNotificationRule per rule
dsql-cluster.go GetCluster per cluster
dynamodb-item.go DescribeTable + Scan per table
eks-clusters.go DescribeCluster per cluster
eks-nodegroups.go DescribeNodegroup per nodegroup
elasticsearchservice-domain.go DescribeElasticsearchDomain per domain
neptune-graph.go ListGraphSnapshots per graph
opensearchservice-domain.go DescribeDomainConfig per domain
qldb-ledger.go DescribeLedger per ledger
textract-adapters.go GetAdapter per adapter
transfer-server.go DescribeServer per server
transfer-server-user.go DescribeUser per user
waf-rules.go GetRule per rule
waf-webacl-rule-attachments.go GetWebACL per ACL
bedrockagentcorecontrol-gateway.go GetGateway per gateway
bedrockagentcorecontrol-workloadidentity.go GetWorkloadIdentity per identity
iot-thinggroups.go DescribeThingGroup per group
cognito-userpool-domain.go DescribeUserPool per pool

Category 3: Nested Listing Failures (list sub-resources per parent)

Resource File Buggy Call
cloudwatchevents-rule.go ListRules per event bus
cloudwatchevents-target.go ListRules per bus + ListTargetsByRule per rule
codedeploy-deployment-group.go ListDeploymentGroups per application
cognito-identity-provider.go ListIdentityProviders per user pool
cognito-userpool-client.go ListUserPoolClients per user pool
ec2-client-vpn-endpoint-attachment.go DescribeClientVpnTargetNetworks per endpoint
ec2-internet-gateway-attachment.go DescribeInternetGateways per VPC
ec2-vpc-endpoint.go DescribeVpcEndpoints per VPC
ec2-vpn-gateway-attachments.go DescribeVpnGateways per VPC
ecs-clusterinstances.go ListContainerInstances per cluster
ecs-services.go ListServices per cluster
ecs-task.go ListTasks per cluster
efs-mount-targets.go DescribeMountTargets per filesystem
eks-nodegroups.go ListNodegroups per cluster
ga-endpoints.go ListListeners per accelerator + ListEndpointGroups per listener
ga-listeners.go ListListeners per accelerator
iam-group-policies.go ListGroupPolicies per group
iam-group-policy-attachments.go ListAttachedGroupPolicies per group
iam-user-access-key.go ListAccessKeys + ListUserTags per user
iam-user-group-attachments.go ListGroupsForUser per user
iam-user-https-git-credential.go ListServiceSpecificCredentials + ListUserTags per user
iam-user-mfa-device.go ListMFADevices per user
iam-user-policy.go ListUserPolicies per user
iam-user-policy-attachment.go ListAttachedUserPolicies per user
iam-user-ssh-keys.go ListSSHPublicKeys per user
iam-signing-certificate.go ListSigningCertificates per user
iam-service-specific-credentials.go ListServiceSpecificCredentials per user
imagebuilder-components.go ListComponentBuildVersions per component
imagebuilder-images.go ListImageBuildVersions per image
iot-policies.go ListTargetsForPolicy + ListPolicyVersions per policy
iot-things.go ListThingPrincipals per thing
lambda-layers.go ListLayerVersionsPages per layer
managedblockchain-member.go ListMembers per network
mediastoredata-items.go ListItems per container
opsworks-apps.go DescribeApps per stack
opsworks-instances.go DescribeInstances per stack
opsworks-layers.go DescribeLayers per stack
route53-resource-record.go ListResourceRecordsForZone per zone
route53-traffic-policies.go instancesForPolicy per policy
s3-multipart-upload.go ListMultipartUploads per bucket
s3-object.go ListObjectVersions per bucket
servicecatalog-portfolio-constraints-attachments.go ListConstraintsForPortfolio per portfolio
servicecatalog-portfolio-principal-attachments.go ListPrincipalsForPortfolio per portfolio
servicecatalog-portfolio-product-attachments.go ListPortfoliosForProduct per product
servicecatalog-portfolio-share-attachments.go ListPortfolioAccess per portfolio
servicecatalog-portfolio-tagoptions-attachements.go ListResourcesForTagOption per tag option
servicediscovery-instances.go ListInstances per service
sns-endpoints.go ListEndpointsByPlatformApplication per app
textract-adapter-versions.go ListAdapterVersions per adapter
transfer-server-user.go ListUsers per server
appconfig-configurationprofiles.go ListConfigurationProfiles per app
appconfig-environments.go ListEnvironments per app
appconfig-hostedconfigurationversions.go ListHostedConfigurationVersions per profile
appmesh-virtualgateway.go ListVirtualGateways per mesh
appmesh-virtualnode.go ListVirtualNodes per mesh
appmesh-virtualrouter.go ListVirtualRouters per mesh
appmesh-virtualservice.go ListVirtualServices per mesh
appmesh-gatewayroute.go ListVirtualGateways per mesh + ListGatewayRoutes per gateway
appmesh-route.go ListVirtualRouters per mesh + ListRoutes per router
appstream-stack-fleet-attachments.go ListAssociatedFleets per stack
appsync-api-association.go GetApiAssociation per domain
athena-named-query.go ListNamedQueries per workgroup
athena-prepared-statement.go ListPreparedStatements per workgroup
autoscaling-lifecycle-hook.go DescribeLifecycleHooks per ASG
backup-vaults-access-policies.go GetBackupVaultAccessPolicy per vault
bedrock-agent-alias.go ListAgentAliases per agent
bedrock-agent-datasource.go ListDataSources per knowledge base
bedrock-flow-alias.go ListFlowAliases per flow
ses-receiptrulesets.go DescribeActiveReceiptRuleSet per ruleset
wafregional-byte-match-set-tuples.go GetByteMatchSet per set
wafregional-ip-set-ips.go GetIPSet per set
wafregional-rate-based-rule-predicates.go GetRateBasedRule per rule
wafregional-regex-match-tuples.go GetRegexMatchSet per set
wafregional-regex-pattern-tuples.go GetRegexPatternSet per set
wafregional-rule-predicates.go GetRule per rule
wafregional-rules.go GetRule per rule
wafregional-webacl-rule-attachments.go GetWebACL per ACL

Resources That Already Handle This Correctly (for reference)

These resources demonstrate the correct pattern and can be used as examples:

  • cloudwatchlogs-loggroup.go — fixed in CloudWatchLogsLogGroup resource type has cascading failures in the discovery process when API calls for single log group fails #842, uses logrus.Warn + continue
  • rds-clusters.go, rds-instances.go, rds-dbparametergroups.go, etc. — ListTagsForResource uses continue
  • lambda-function.go — uses continue for tag failures
  • iam-role.go, iam-policy.go, iam-user.go — uses continue for enrichment failures
  • s3-bucket.go — uses logrus + continue for tag failures
  • dynamodb-table.go — uses logrus.Warn + continue
  • eks-fargate-profile.go — uses logrus.Error + continue
  • memorydb-*.go — all use continue for ListTags failures
  • neptune-cluster.go, neptune-instance.go — log warning and continue
  • sns-topics.go, sqs-queues.go — use continue or logrus for tag failures

Note: May be worth aligning on the log level across resources for consistency

Suggested Fix Approach

  1. For each affected file, change return nil, err to logrus.WithError(err).Warn(...) + continue on per-resource enrichment calls
  2. Include the resource identifier (ARN, name, ID) in the warning log for debuggability
  3. Use a consistent warning message format: "unable to <action> for <resource type>, skipping to avoid incorrect filtering"
  4. Add mock tests for the error path to prevent regression

This could be done incrementally per-service or in bulk. The fix is mechanical and consistent across all affected files.

When I have the bandwidth I don't mind tackling this, just wanted to put the issue up to track it!

Related

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions