Model Incident Response & Recovery

AI incidents are production incidents where model behavior causes or materially contributes to harm, policy breach, business loss, or unacceptable user impact.

Severity Model

Severity	Example	Target Response
SEV-1	High-impact harmful output at scale, regulatory risk	Immediate containment
SEV-2	Significant user-facing degradation or repeated unsafe outputs	Contain within same business day
SEV-3	Localized quality failure with bounded impact	Fix in normal incident cycle

Immediate Response

Stop or reduce blast radius (rollback, disable endpoint, reduce audience).
Preserve evidence (request IDs, traces, model version, prompt/runtime config).
Notify required stakeholders (engineering, product, security, legal if needed).
Publish initial customer/internal status update.

Root Cause Dimensions

Classify cause across these dimensions:

Data quality or distribution shift
Prompt/policy regression
Model version regression
Guardrail failure
Evaluation gap (issue should have been caught pre-release)

Corrective Action Requirements

Every SEV-1/SEV-2 incident MUST produce:

Timeline with decision points
Root cause statement
Verified corrective action
Preventive control update (evaluation, monitoring, guardrail, or process)

Recovery Gate

Before restoring full traffic:

Fix validated on representative incident scenarios
Monitoring thresholds tightened for 7 days
Product + engineering + governance sign-off recorded

Severity Model​

Immediate Response​

Root Cause Dimensions​

Corrective Action Requirements​

Recovery Gate​

Severity Model

Immediate Response

Root Cause Dimensions

Corrective Action Requirements

Recovery Gate