Closing the loop on cloud security remediation using GenAI

Aman Bansal
3 days ago
12 min read

Most cloud security programs today have detection figured out and remediation unsolved. The customers we work with have AWS Security Hub aggregating findings from Amazon GuardDuty, Amazon Inspector, Amazon Macie, AWS Config, AWS IAM Access Analyzer, and third-party Cloud Security Posture Management (CSPM) tools. The dashboards are full. What is missing is a reliable, repeatable path from a finding to a durable fix, with evidence that the fix worked and assurance that the same issue will not reappear next quarter.

In this post, we describe a reference architecture that closes this loop. It uses AWS-native services for aggregation, routing, and runtime remediation, adds Amazon Bedrock for generative-AI-powered Infrastructure as Code (IaC) fix generation, and feeds persistent findings back into preventive controls such as Service Control Policies (SCPs), AWS Control Tower guardrails, and pre-commit IaC scans. The pattern works regardless of the size of your security organization, and we show how the implementation sophistication scales with your needs.

By the end of this post, you will be able to:

Diagnose why your current remediation pipeline is leaking findings
Map a nine-stage closed-loop remediation pattern onto AWS services
Decide which stages to build first based on your organization's current state
Decouple verification from discovery tooling so the loop is resilient to CSPM vendor changes
Operate the feedback loop that converts repeated remediation into permanent prevention

This is a conceptual reference. We focus on the design decisions and tradeoffs; we link to a companion implementation for code-level detail.

Why remediation fails

In customer engagements across company sizes, four failure modes consistently block remediation regardless of detection tooling:

Ownership gap: the security team sees the finding, but they do not own the resource. The team that does own the resource does not know about the finding.
Surface-level fixes: the live resource gets patched, but the Terraform module or CloudFormation template that created it still has the misconfiguration. The next deployment recreates the vulnerability.
Triage overload: every finding is marked high severity, which means nothing is actually prioritized. The security operations team spends more time classifying than fixing.
Open-ended findings: there is no forcing function to close or formally accept a finding. Findings accumulate in Security Hub indefinitely, muddying the signal.

These are operational and organizational gaps. Tooling alone cannot close them. The pattern below addresses all four failure modes by design.

The closed-loop pattern

The core idea is simple: every finding should travel a predetermined path that ends in one of three destinies:

Fixed: the resource is back in compliance, verified by an independent check
Accepted with exception: a documented business decision, with an expiration date
Prevented from recurring: the pattern is now blocked at the policy layer, making the finding impossible to reproduce

The loop has nine stages:

Detect -> Aggregate -> Route -> Triage -> Remediate -> Verify -> Prevent

^ |

--------------------------------------------------------------------------------------

The arrow from Prevent back to Detect is the closure. When the prevention layer blocks the vulnerability at deploy time, the detection layer stops surfacing it. That is what closed-loop means in this context.

Three invariants hold across all nine stages:

Every finding has a destiny: state transitions follow defined paths, never a default of "open forever"
Ownership is data, not tribal knowledge: a query-able table maps resources to owners
Verification is independent of discovery: the tool that found the finding is not the tool that closes it

Solution overview

The architecture maps the nine stages onto AWS services. Third-party CSPM tools (Wiz, Prisma Cloud, Orca, and others) integrate through Security Hub's native third-party product integration.

The nine stages and their service mappings:

Stage	Job	AWS services
1. Detect	Generate findings	GuardDuty, Inspector, Macie, Config, IAM Access Analyzer, third-party CSPM, Checkov in CI
2. Aggregate	Single finding store	Security Hub with Automation Rules
3. Route	Dispatch by severity and type	Amazon EventBridge
4. Triage	Decide owner, auto vs. human, runtime vs. IaC	AWS Lambda + Amazon DynamoDB
5. Remediate (runtime)	Fix the live resource	AWS Systems Manager Automation, Lambda, AWS Step Functions
6. Remediate (IaC)	Fix the source code	Lambda + Amazon Bedrock + Git provider API
7. Notify	Inform owners, create tickets	Amazon SNS, Lambda (for Jira/ADO)
8. Verify	Independent confirmation	Lambda + AWS SDK + Security Hub BatchUpdateFindings
9. Prevent	Block at policy layer	Service Control Policies, Control Tower guardrails, Checkov custom rules

The next sections walk through each stage at the design level. We focus on the four stages most often misimplemented: triage (Stage 4), IaC remediation (Stage 6), verification (Stage 8), and prevention (Stage 9). Code-level detail for the IaC remediation pipeline lives in a companion reference implementation, called out where relevant.

Stage 1: Detection

This stage does not require new building. Customers already have detection coverage. The architecture's only requirement is that every finding source lands in Security Hub.

AWS-native services integrate automatically when enabled in the same AWS account and Region. GuardDuty auto-publishes findings to Security Hub when both services are enabled; Inspector auto-publishes when the Security Hub integration is enabled in Inspector settings; Macie, Config, and IAM Access Analyzer behave similarly. For third-party tools, Security Hub offers native integrations for over 40 partner products. Custom or proprietary scanners ingest through BatchImportFindings, either directly from the vendor or through a thin Lambda adapter that maps the source format to AWS Security Finding Format (ASFF).

Stage 2: Aggregation with Security Hub Automation Rules

Security Hub Automation Rules enrich findings before they fan out. Use them to escalate severity based on resource tags (production resources upgrade to Critical), suppress known false positives, and add custom tags that downstream stages use for routing.

The mental model is simple: rules match on finding shape (resource type, severity, tags) and apply field updates (severity adjustment, workflow status, custom labels). A typical first-week rule escalates MEDIUM findings on production-tagged resources to HIGH, which lets downstream EventBridge filters use a single severity check rather than juggling tag conditions.

Stage 3: Routing with EventBridge

Security Hub emits all findings to the default EventBridge bus. Write EventBridge rules that match on severity, resource type, and account, and route to different targets. The rule shape is small and worth seeing once:

{

"source": ["aws.securityhub"],

"detail-type": ["Security Hub Findings - Imported"],

"detail": {

"findings": {

"Severity": { "Label": ["CRITICAL", "HIGH"] },

"Workflow": { "Status": ["NEW"] }

}

This pattern targets the triage Lambda function described in Stage 4. Authoring one rule per route (auto-remediable, requires-approval, human-only) keeps the routing layer simple and auditable.

Stage 4: Triage and notification

The triage Lambda function answers three questions for each finding:

Who owns this resource?
Is the finding auto-remediable, or does it need human judgment?
Is the right fix a runtime change or an IaC change?

Ownership is stored in DynamoDB keyed by tag pattern (for example, Environment=production,Team=platform maps to the platform-eng team). Security Hub Resources[].Tags is a list of {Key, Value} records, not a flat dictionary; the lookup must iterate the list and reconstruct the pattern key. This is a small detail that bites teams that copy a quick example.

Auto-remediable is a configuration table: per finding type, a flag for whether automated remediation runs without human approval, and a path indicating runtime versus IaC. Customers we work with start with a tight allowlist (a few well-understood finding types) and expand the table as confidence grows.

Routing then dispatches: runtime-eligible findings go to a Step Functions state machine that calls Systems Manager Automation; IaC-eligible findings go to the Bedrock-powered fix generator (Stage 6); everything else goes to the human-review path.

Notification (Stage 7 in the original nine-stage model) is part of triage. The triage Lambda function publishes to an SNS topic regardless of the path. The topic fans out to Slack or Microsoft Teams for awareness and to a ticket-creation Lambda function for Jira or Azure DevOps. The ticket records what happened for humans to review; it is an artifact of the workflow, not the workflow itself. The workflow is the Step Functions state machine plus the Bedrock fix path.

A pattern worth borrowing: when ownership lookup returns unassigned, route to a single default team (typically platform engineering) rather than a queue no one monitors. This creates organizational pressure to claim ownership, which keeps the ownership table current.

Stage 5: Runtime remediation with Systems Manager Automation

For findings where the fix is a configuration change on a live resource, use Systems Manager Automation runbooks. AWS maintains a library of automation documents for common scenarios, including AWSConfigRemediation-EnableS3BucketEncryption, AWSConfigRemediation-RevokeUnusedIAMUserCredentials, AWS-EnableCloudTrail, and AWSConfigRemediation-RemoveUnrestrictedSourceIngressRules. For custom remediations, author your own SSM document and invoke it from a Step Functions state machine. Step Functions provides approval gates (using the waitForTaskToken integration with SNS) and rollback logic at much lower cost than building it inside a Lambda function.

The state machine pattern customers settle on has three states: a Choice state that determines whether approval is required (typically based on severity), an SNS-with-task-token state for the approval gate, and a Task state that calls ssm:StartAutomationExecution. Each Task state catches errors and routes to a verification Lambda function (Stage 8), so even failed remediations end with the loop continuing rather than dropping the finding.

Stage 6: IaC remediation with Amazon Bedrock

This is the stage that differentiates closed-loop remediation from traditional auto-remediation. Fixing the live S3 bucket's encryption setting does not help if the Terraform module that created it still produces unencrypted buckets. The durable fix lives in the source code.

A Lambda function handles this stage in four steps: locate the IaC that created the resource (via resource tags or CloudFormation stack lookup); read the relevant file from the Git repository; send the finding context and the IaC content to Bedrock with instructions to produce a minimal fix; commit the fix to a new branch and open a pull request against the customer's Git provider. The generator works across GitHub, GitLab, Bitbucket, and AWS CodeCommit by using provider-specific adapters for the commit and pull-request step. The Bedrock call itself is provider-agnostic.

The implementation discipline is what matters more than the prompt. A reliable IaC fix generator runs every model output through a validation pipeline before committing anything:

Parse: confirm the model returned a valid unified diff (or whatever artifact you asked for). Reject the response if it cannot be parsed.
Apply: confirm the diff applies cleanly to the file as currently checked out. Reject if hunks fail.
Size: cap the total changed lines (50 is a useful starting point). Reject diffs that exceed the cap, since unusually large diffs almost always indicate the model changed more than necessary.
File scope: confirm the diff only modifies the file the Repository Reader located. Reject any diff that touches a foreign path.
Resource scope: parse the post-application file with HCL or YAML parsing, identify every resource block changed by the diff, and reject any diff that modifies a resource other than the one targeted by the finding (with a small allowlist of permitted companion resources, for example aws_s3_bucket_public_access_block when the finding is S3.PublicAccess).

When a check fails, retry the Bedrock call with the rejection reason appended to the prompt. Cap retries at three attempts. After the third rejection, mark the finding as FIX_GENERATION_FAILED and route it to human review. The companion implementation referenced at the end of this post documents this pipeline in working code.

The pull-request description, populated by the Lambda function, should include the Security Hub finding ID, the severity, the resource affected, and a link back to the finding. Reviewers see context without cross-referencing.

Stage 7: Independent verification

This is the stage most remediation architectures get wrong. The mistake is assuming that if Wiz found the issue, Wiz has to confirm it is fixed. This creates a coupling problem: verification latency becomes the speed of the slowest third-party scan schedule, which is typically hours to days. The closed loop breaks if you ever change CSPM vendors.

Independent verification means the verification Lambda function queries the AWS resource directly via the AWS SDK, regardless of which tool originally surfaced the finding. It does not depend on any scan schedule. The function follows a simple pattern per resource type, illustrated for S3 buckets:

For S3.PublicAccess findings: call GetBucketAcl and GetPublicAccessBlock. Mark resolved only if no grants reach AllUsers or AuthenticatedUsers AND all four public-access-block flags are true.
For S3.MissingEncryption findings: call GetBucketEncryption. Mark resolved only if at least one rule has SSEAlgorithm of AES256 or aws:kms.
For S3.MissingVersioning findings: call GetBucketVersioning. Mark resolved only if Status == "Enabled".

Three implementation details that the prototype captured and that production deployments need:

Wait period. The verifier runs after a configurable wait (the prototype uses 30 to 1800 seconds, default 300 seconds). Customer pipelines apply Terraform changes asynchronously after a pull request merges; verification must give the apply enough time to complete.
Transient-error handling. SDK calls fail intermittently from throttling, network blips, and HTTP 5xx responses. The verifier retries each call up to twice with exponential backoff (1 second, 2 seconds) before declaring ERROR. Non-transient errors fail fast.
Evidence redaction. The verification record stores the raw SDK responses as evidence. Before logging, walk the response and replace the values of any field whose key is AccessKeyId, SecretAccessKey, SessionToken, or Authorization with the literal string [REDACTED]. AWS responses occasionally surface credential-bearing fields (especially in error metadata), and an audit log that captures them is itself a security finding.

After the SDK check, the verifier closes the Security Hub finding using BatchUpdateFindings with Workflow.Status = RESOLVED and a note recording the verification request ID. CloudTrail captures every BatchUpdateFindings call and SDK call, producing a clean audit trail without manual work. For third-party findings from Wiz, Prisma Cloud, or Orca, optionally call the vendor's re-scan API after the SDK verification completes to keep the vendor's dashboard in sync. The authoritative close is always the independent SDK check, never the vendor's next scan window.

This pattern provides three benefits. Verification completes in seconds rather than the hours-to-days of a vendor's next scan window. The architecture survives CSPM vendor changes without re-architecting the loop. The audit trail is automatic.

Stage 8: Prevention - where the loop closes on itself

When the same finding type appears repeatedly across the organization, stop fixing and start preventing. Move the control from reactive to proactive.

The prevention layer has three tools:

Service Control Policies (SCPs): organization-wide blocks. Use for actions that should never occur, like disabling CloudTrail or creating IAM users with AdministratorAccess.
Control Tower guardrails: account-level controls with both preventive (SCP-backed) and detective (Config-backed) variants.
Custom Checkov rules in CI: block at source, before anything deploys.

The right tool depends on what you are blocking. SCPs and guardrails operate at the API call level, so they catch any path to the bad state (Console, CLI, SDK, IaC after deploy). Checkov operates at the source-code level, which gives developers feedback faster but only catches the IaC path. The full stack uses all three: Checkov for fast feedback at commit time, SCPs and guardrails as a safety net at deploy time and for non-IaC paths.

A practical caution on SCP authorship: SCPs that look right often do not match the action's request context. The S3 condition key s3:x-amz-acl is checked on PutBucketAcl but not on PutBucketPolicy, so a single SCP that denies "publishing public ACLs" needs separate statements per action. Run SCPs through the IAM policy simulator and pilot them in a test organizational unit (OU) for at least two weeks before promoting to production. Track denied API calls that turned out to be legitimate operations, and adjust the SCP boundaries before they become outage triggers.

Track a single metric for this stage: the number of finding types that moved from "remediate repeatedly" to "prevented by policy." This is the metric that proves the loop is getting smarter over time.

Scale-agnostic implementation

The pattern works at any company size. What changes with scale is implementation sophistication, not the shape of the loop:

Stage	Startup (10 engineers)	Mid-market (200 engineers)	Enterprise (10,000 engineers)
Aggregate	One dashboard	Security Hub + automation rules	Multi-account delegated admin, cross-Region
Route	Manual review	EventBridge rules	EventBridge + account factory integration
Triage	One person	Simple runbook	DynamoDB ownership + Step Functions workflow
Remediate (runtime)	Manual or SSM docs	SSM + Step Functions	Full orchestration across hundreds of accounts
Remediate (IaC)	PR comments by hand	Bedrock PR generation for one repo	Bedrock + context + multi-repo + few-shot
Prevent	Code review	Checkov in CI	SCPs + Control Tower + custom Checkov library

A startup can implement this pattern in a week or two with Security Hub, one EventBridge rule, and one SSM runbook. An enterprise with 500 accounts needs the full architecture, delegated admin patterns, and organizational rollout discipline. Both are implementing the same pattern.

Conclusion

Cloud security programs succeed or fail at remediation. Detection is a solved problem; the gap between detecting a finding and durably fixing it is where risk accumulates. The closed-loop architecture in this post uses AWS Security Hub for aggregation, EventBridge for routing, Lambda and DynamoDB for triage, SSM Automation for runtime fixes, Amazon Bedrock for durable IaC fixes, and SCPs plus Checkov for prevention. Independent verification via the AWS SDK closes the loop regardless of which tool surfaced the original finding.

The pattern is vendor-agnostic and scale-agnostic. A startup can implement the minimum viable version in two weeks. An enterprise uses the same pattern with more sophistication around multi-account rollout and feedback-to-prevention tracking. Both are closing the loop from finding to fix to prevention, with audit evidence at every step.

To get started, pick one of three entry points based on your current state:

If you have no closed loop today, start with Security Hub aggregation, one EventBridge rule, and one SSM Automation runbook for your highest-volume finding type. This is two weeks of work.
If you have runtime remediation but nothing durable, add the Bedrock-powered IaC fix stage for one repository and one finding type. This is two to four weeks of work, plus the validator pipeline.
If you have the full loop but findings still recur, build the feedback-to-prevention pipeline. Track finding types you remediate more than twice per quarter and convert them to SCPs or Checkov rules.

The single metric that distinguishes a closed loop from an open one is the prevention ratio: the proportion of finding types you have moved from reactive remediation to preventive policy. Track that ratio, watch it grow, and the architecture in this post is doing the work it was designed to do.

For deeper dives on individual components, see:

Cybersecurity Insights