AWS Incident Response & Auto-Remediation for the DevOps Engineer Professional (DOP-C02)

A lot of candidates over-index on CI/CD when studying for the AWS Certified DevOps Engineer – Professional (DOP-C02) exam and under-prepare for the domain that actually separates a senior DevOps engineer from a pipeline operator: Incident and Event Response. It’s worth roughly 14% of the exam, and the questions are scenario-heavy. They don’t ask “what does EventBridge do?” — they ask “an event fires, what’s the most operationally efficient, least-privilege way to detect it, notify the right people, and automatically fix it?”

This guide walks the whole domain from a practitioner’s perspective. You’ll get the event-driven architecture AWS expects you to reach for, the exact services and how they connect, and copy-pasteable CLI and JSON so the patterns become muscle memory before exam day.

If you still need the big picture first, start with the AWS DevOps Engineer Professional Exam Guide 2026 and the AWS DevOps Engineer study plan, then come back here to go deep on incident response.

What “Incident and Event Response” Covers

In the DOP-C02 blueprint this domain has three tasks:

Task	What it means in practice
Manage event sources	Route events from EventBridge, CloudWatch, Health, and Config to the right targets
Implement automated remediation	Fix the problem with Lambda, SSM Automation, or Config remediation
Troubleshoot failures	Use logs, metrics, and traces to find root cause

The mental model the exam rewards is event-driven automation: something changes → an event is emitted → a rule routes it → a target reacts (notify and/or remediate). Almost every correct answer in this domain is some flavor of that pattern. Let’s build it up service by service.

Event Sources: EventBridge Is the Hub

Amazon EventBridge (the evolution of CloudWatch Events) is the backbone of event-driven response on AWS. It receives events on an event bus, matches them against rules using an event pattern, and forwards matches to one or more targets (Lambda, SNS, SQS, Step Functions, SSM Automation, and many more).

There are two ways a rule triggers:

Event pattern — react to something that happened (an EC2 state change, a GuardDuty finding, an AWS Config compliance change).
Schedule — run on a cron or rate expression (a nightly compliance sweep, an hourly health check).

Here’s an event pattern that matches an EC2 instance entering a stopped or terminated state:

{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["stopped", "terminated"]
  }
}

Create the rule and attach an SNS topic so on-call gets paged:

aws events put-rule \
  --name ec2-stopped-alert \
  --event-pattern file://ec2-pattern.json

aws events put-targets \
  --rule ec2-stopped-alert \
  --targets "Id"="1","Arn"="arn:aws:sns:us-east-1:111122223333:ops-alerts"

Two exam-critical EventBridge details:

A single rule can have up to 5 targets, and you can fan out to notification and remediation simultaneously.
Many AWS services (GuardDuty, Security Hub, AWS Health, Config) emit events only to the default event bus — know that custom buses are for your own and partner events.

Notification: CloudWatch Alarms and SNS

Not every signal is a discrete event; many are metrics crossing a threshold. That’s CloudWatch Alarms.

# Alarm when an ASG's average CPU exceeds 80% for 2 consecutive minutes
aws cloudwatch put-metric-alarm \
  --alarm-name high-cpu-web-asg \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=AutoScalingGroupName,Value=web-asg \
  --statistic Average --period 60 --evaluation-periods 2 \
  --threshold 80 --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:111122223333:ops-alerts

Know these distinctions cold, because the exam tests them directly:

Concept	Use it when
Metric alarm	A single metric crosses a static threshold
Composite alarm	Combine multiple alarms with AND/OR to cut alert noise
Anomaly detection	The “normal” range varies (diurnal traffic) and a static threshold is wrong
Metric filter	Turn a pattern in CloudWatch Logs into a metric you can alarm on

A frequent scenario: “too many alarm emails.” The intended answer is usually a composite alarm (only page when CPU and latency are both bad) rather than tuning each metric alarm in isolation.

SNS is the notification fan-out: one topic → email, SMS, a chat webhook via AWS Chatbot, or a Lambda. EventBridge and CloudWatch Alarms both target SNS, which is why it shows up in almost every answer.

Auto-Remediation: the Core Skill

Notification is table stakes. The professional-level skill — and the most heavily tested — is automatically fixing the problem. There are three canonical remediation engines, and choosing the right one is usually the crux of the question.

1. AWS Config remediation (for compliance drift)

AWS Config continuously evaluates resource configuration against rules. When a resource goes non-compliant, you can attach a remediation action that runs an SSM Automation document — no Lambda required.

Classic example: an S3 bucket becomes publicly readable. A Config rule (s3-bucket-public-read-prohibited) flags it, and the attached remediation (AWS-DisableS3BucketPublicReadWrite) fixes it automatically.

aws configservice put-remediation-configurations \
  --remediation-configurations '[{
    "ConfigRuleName": "s3-bucket-public-read-prohibited",
    "TargetType": "SSM_DOCUMENT",
    "TargetId": "AWS-DisableS3BucketPublicReadWrite",
    "Automatic": true,
    "MaximumAutomaticAttempts": 3,
    "RetryAttemptSeconds": 60,
    "Parameters": {
      "AutomationAssumeRole": {"StaticValue": {"Values": ["arn:aws:iam::111122223333:role/ConfigRemediationRole"]}},
      "S3BucketName": {"ResourceValue": {"Value": "RESOURCE_ID"}}
    }
  }]'

Reach for Config remediation when the trigger is “a resource is misconfigured/non-compliant.” It’s the least-effort, most-managed option, and Automatic: true makes it hands-off.

2. Systems Manager Automation runbooks (for operational fixes)

AWS Systems Manager Automation documents (runbooks) are reusable, multi-step operational procedures. AWS ships hundreds (AWS-RestartEC2Instance, AWS-StopEC2Instance, AWSSupport-* diagnostics), and you can author your own in YAML/JSON.

You can trigger a runbook directly from an EventBridge rule — no Lambda glue:

aws events put-targets --rule unhealthy-instance \
  --targets '[{
    "Id": "remediate",
    "Arn": "arn:aws:ssm:us-east-1::automation-definition/AWS-RestartEC2Instance",
    "RoleArn": "arn:aws:iam::111122223333:role/EventBridgeSSMRole",
    "InputTransformer": {
      "InputPathsMap": {"instance": "$.detail.instance-id"},
      "InputTemplate": "{\"InstanceId\": [<instance>]}"
    }
  }]'

A minimal custom runbook that restarts an instance:

schemaVersion: '0.3'
description: Restart an EC2 instance
assumeRole: '{{ AutomationAssumeRole }}'
parameters:
  InstanceId: { type: String }
  AutomationAssumeRole: { type: String }
mainSteps:
  - name: stopInstance
    action: aws:changeInstanceState
    inputs:
      InstanceIds: ['{{ InstanceId }}']
      DesiredState: stopped
  - name: startInstance
    action: aws:changeInstanceState
    inputs:
      InstanceIds: ['{{ InstanceId }}']
      DesiredState: running

Reach for SSM Automation when the fix is an operational procedure (restart, patch, snapshot, isolate) — especially one AWS already provides. The InputTransformer above is a favorite exam detail: it reshapes the event JSON into the parameters the runbook expects.

3. Lambda (for custom logic)

When remediation needs custom business logic that no managed runbook covers, an EventBridge rule targets a Lambda. The Lambda receives the event, makes a decision, and calls the AWS SDK to fix things.

import boto3

def handler(event, context):
    # Quarantine an EC2 instance flagged by GuardDuty
    instance_id = event["detail"]["resource"]["instanceDetails"]["instanceId"]
    ec2 = boto3.client("ec2")
    ec2.modify_instance_attribute(
        InstanceId=instance_id,
        Groups=["sg-quarantine"]          # swap to an isolation security group
    )
    return {"status": "isolated", "instance": instance_id}

Reach for Lambda only when the logic is genuinely custom. If a Config remediation or an SSM runbook can do it, the exam considers those the “more operationally efficient” answer because there’s less code to own.

Choosing the right engine

Trigger	Best remediation engine
Resource non-compliant (S3 public, SG open, unencrypted volume)	AWS Config remediation → SSM document
Operational fix (restart, patch, snapshot, reboot)	SSM Automation runbook from EventBridge
GuardDuty/Security Hub finding needing custom logic	Lambda (or Security Hub custom action → EventBridge)
Scheduled compliance sweep	EventBridge schedule → SSM Automation

Putting It Together: a Reference Pattern

A complete, exam-canonical incident-response flow for a security finding looks like this:

GuardDuty detects a compromised instance and emits a finding.
The finding lands on the default event bus; an EventBridge rule matches it.
The rule fans out to two targets: an SNS topic (notify on-call) and an SSM Automation runbook or Lambda (isolate the instance).
Remediation actions and approvals can be tracked in Systems Manager Incident Manager / OpsCenter.
Everything is logged to CloudWatch Logs and CloudTrail for the post-incident review.

That single picture — source → bus → rule → notify + remediate → record — answers a surprising share of this domain’s questions. The architecture matches the operational-excellence thinking in the broader AWS DevOps Engineer monitoring guide, and it complements the deployment automation covered in the AWS DevOps CI/CD guide.

Troubleshooting and Root Cause

The third task is finding why something broke. The tools the exam expects:

CloudWatch Logs Insights — query structured logs fast:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

CloudWatch Logs subscription filters — stream matching log lines to Lambda/Kinesis in real time for immediate reaction.
AWS X-Ray — distributed tracing to find the slow or failing hop in a microservice call chain.
CloudTrail — the audit trail of who made which API call, essential for “what changed right before the incident?”

A good habit: when a deployment causes an incident, correlate the CloudTrail event timeline with the CloudWatch metric anomaly to pinpoint the change that triggered it.

A 7-Day Domain 5 Study Plan

Day	Focus	Hands-on goal
1	EventBridge rules & patterns	Write 3 event patterns, attach SNS + a second target
2	CloudWatch alarms	Build metric, composite, and anomaly-detection alarms
3	AWS Config remediation	Auto-fix a public S3 bucket end-to-end
4	SSM Automation	Trigger `AWS-RestartEC2Instance` from EventBridge with an InputTransformer
5	Lambda remediation	Isolate a GuardDuty-flagged instance with a custom function
6	Troubleshooting	Logs Insights query + X-Ray trace on a broken app
7	Mixed scenarios	Time yourself answering “detect → notify → remediate” questions

The pattern that separates a pass from a fail here is recognizing which remediation engine the scenario wants — and that only comes from building each one at least once.

Practice in a Real Exam-Like Environment

DOP-C02 incident-response questions are scenario-based and time-pressured. The fastest way to internalize “which service, which target, which remediation” is to drill realistic questions with explanations until the patterns are automatic.

Sailor.sh’s AWS Certified DevOps Engineer – Professional (DOP-C02) Mock Exam Bundle includes hundreds of exam-style questions with detailed explanations across all six domains — including event-driven incident response and auto-remediation. Each answer explains why the right remediation engine wins, so you build the decision-making the real exam tests. To round out your prep, pair it with the AWS DevOps Engineer practice questions and the free resources guide.

Make sure you’ve also cleared the AWS DevOps Engineer prerequisites before sitting this professional-level exam.

Frequently Asked Questions

How much of the DOP-C02 exam is incident and event response?

Incident and Event Response is roughly 14% of the AWS Certified DevOps Engineer – Professional exam. It’s deeply intertwined with the Monitoring and Logging domain, so studying them together pays off.

When should I use AWS Config remediation versus a Lambda?

Use Config remediation when the trigger is a resource being non-compliant or misconfigured (public S3 bucket, open security group, unencrypted volume) — it runs a managed SSM document with no code to maintain. Use Lambda only when remediation needs custom logic that no managed runbook or Config rule covers. The exam treats the lower-code, managed option as “more operationally efficient.”

What’s the difference between EventBridge and CloudWatch Alarms?

EventBridge reacts to discrete events (a state change, a finding, a schedule) by matching an event pattern. CloudWatch Alarms react to metrics crossing a threshold over time. Many designs use both: an alarm fires, sends to SNS, and an EventBridge rule on the alarm-state change kicks off remediation.

Can EventBridge trigger Systems Manager Automation directly?

Yes — SSM Automation is a native EventBridge target, so you can run a runbook with no Lambda glue. Use an InputTransformer to reshape the event JSON into the runbook’s parameters (for example, extracting the instance ID).

Why do GuardDuty and Config events only reach the default event bus?

AWS service-generated events are delivered to the default event bus in each account/region. Custom event buses are for your own application events and partner integrations. On the exam, if a question routes a GuardDuty finding, the rule belongs on the default bus.

What’s the single highest-impact pattern to memorize?

Source → EventBridge rule → fan out to SNS (notify) and SSM Automation/Lambda (remediate). That one architecture answers a large share of this domain’s questions; internalize it and adapt the remediation engine to the scenario.

Conclusion

Domain 5 rewards engineers who think in events, not scripts. Once you see every incident as something emits an event → a rule routes it → a target notifies and remediates, the service choices fall into place: EventBridge to route, CloudWatch and SNS to detect and notify, and Config remediation, SSM Automation, or Lambda to fix — in that order of preference.

Build each remediation engine at least once, learn to spot which one a scenario is asking for, and the incident-response questions stop being guesswork. The same event-driven reflexes that earn the certification are exactly what keep real production systems self-healing — which is the whole point of the professional-level exam.