Skip to main content

Risk Metrics

Risk metrics monitor the safety, quality, and security implications of AI-assisted development. These metrics are the essential counterbalance to Productivity Metrics — without them, an organization cannot determine whether AI-driven speed gains come at the expense of code quality, security posture, or system reliability. Given that research shows AI co-authored code carries 1.7x more issues and a 2.74x higher vulnerability rate compared to human-only code, risk measurement is not optional — it is a governance imperative.

Non-Negotiable Measurement

Risk metrics MUST be implemented before or concurrently with productivity metrics. An organization that measures productivity without measuring risk is optimizing for speed while blind to the hazards that speed creates. This pattern is the single most common failure mode in AI-assisted development adoption.

Definition

AI-Related Incident Rate measures the number of production incidents that are attributed — fully or partially — to AI-generated or AI-assisted code, per quarter. This metric is the ultimate lagging indicator of AI code quality: it measures real-world impact on users and systems.

Measurement Method

  • Counting rule: An incident is attributed to AI-generated code when the root cause analysis (RCA) identifies AI-assisted code as a contributing factor. Partial attribution counts as 0.5 incidents.
  • Severity classification: Incidents are classified using standard severity levels (SEV1-SEV4) to enable severity-weighted analysis
  • Unit: Incidents per quarter, with severity-weighted alternative
  • Aggregation: Quarterly, organization-wide and per product/service
  • Attribution method: Code provenance tracking links production code to its AI-assisted origin. Without provenance tracking, attribution relies on RCA investigation, which is less reliable.

Severity-Weighted Scoring

SeverityDescriptionWeightExample
SEV1Complete service outage or data breach affecting customers10AI-generated authentication bypass deployed to production
SEV2Major feature degradation or security vulnerability actively exploited5AI-generated SQL query causes data corruption under load
SEV3Minor feature issue or potential security exposure2AI-generated API endpoint returns incorrect error codes
SEV4Cosmetic issue or non-impacting defect1AI-generated UI component renders incorrectly in edge case

The Severity-Weighted Incident Score is calculated as: Sum of (incident count x severity weight) per quarter.

Targets by Maturity Level

Maturity LevelRaw Count TargetSeverity-Weighted TargetNotes
Level 2Measurement beginsN/AEstablishing attribution capability
Level 3< 5 per quarter< 20 weighted pointsGovernance gates should prevent most issues
Level 4< 2 per quarter< 8 weighted pointsAutomated scanning catches remaining issues
Level 5< 1 per quarter< 3 weighted pointsPredictive risk management prevents issues proactively
Zero Is Suspicious

An organization reporting zero AI-related incidents is likely under-attributing rather than operating flawlessly. This typically indicates inadequate code provenance tracking or RCA processes that do not investigate AI contribution. Organizations SHOULD validate zero-incident reports with provenance coverage audits.

KPI-R2: Security Findings Rate

Definition

Security Findings Rate measures the density of security vulnerabilities identified in AI-assisted code, expressed as findings per 1,000 lines of AI-assisted code (KLOC). This metric is compared against the equivalent rate for human-only code to produce a vulnerability ratio that directly measures whether AI tools are increasing or decreasing the organization's security risk.

Measurement Method

  • Numerator: Total security findings (SAST, DAST, SCA, manual review) in code tagged as AI-assisted
  • Denominator: Thousands of lines of AI-assisted code (KLOC) in the same period
  • Vulnerability Ratio: AI findings rate divided by human-only findings rate. A ratio of 1.0 means parity; above 1.0 means AI code is more vulnerable.
  • Unit: Findings per KLOC (absolute); ratio (relative)
  • Aggregation: Monthly, segmented by finding severity (Critical, High, Medium, Low)
  • Tool requirements: SAST/DAST tools MUST include AI-specific scanning rules (see Level 3 requirements)

Severity-Based Findings Targets

Finding SeverityLevel 3 Target (ratio)Level 4 Target (ratio)Level 5 Target (ratio)Remediation SLA
Critical<= 2.0x baseline<= 1.5x baseline<= 1.0x baseline24 hours
High<= 2.5x baseline<= 1.5x baseline<= 1.0x baseline7 days
Medium<= 3.0x baseline<= 2.0x baseline<= 1.0x baseline30 days
LowNo target<= 2.5x baseline<= 1.5x baseline90 days

Common AI-Specific Vulnerability Patterns

Organizations SHOULD configure scanning tools to prioritize these AI-generated vulnerability patterns:

Vulnerability PatternDescriptionCWE Reference
Hardcoded credentialsAI suggests placeholder secrets that reach productionCWE-798
Insecure default configurationsAI generates permissive configurations (e.g., CORS allow-all)CWE-1188
SQL injection susceptibilityAI-generated database queries lack parameterizationCWE-89
Path traversalAI-generated file handling does not sanitize pathsCWE-22
Missing input validationAI-generated endpoints accept unvalidated inputCWE-20
Insecure deserializationAI suggests deserialization without type checkingCWE-502
Dependency confusionAI suggests packages with names similar to internal packagesCWE-427
Outdated cryptographic algorithmsAI suggests deprecated algorithms from training dataCWE-327

KPI-R3: Rework Percentage

Definition

Rework Percentage measures the proportion of AI-assisted code that requires revision within 30 days of being merged. Rework includes bug fixes, security patches, performance corrections, and refactoring of AI-generated code. This metric captures the "hidden cost" of AI-assisted development — code that appears to be delivered quickly but generates downstream correction work.

Measurement Method

  • Numerator: Lines of AI-assisted code that are modified within 30 days of initial merge (excluding planned refactoring and feature enhancements)
  • Denominator: Total lines of AI-assisted code merged in the measurement period
  • Unit: Percentage
  • Aggregation: Monthly, with a rolling 3-month average for trend analysis
  • Segmentation: SHOULD be segmented by rework reason (bug fix, security patch, performance, refactoring) and by originating team/developer
  • Comparison baseline: Compare AI-assisted code rework rate against human-only code rework rate

Targets by Maturity Level

Maturity LevelAI-Assisted Rework TargetComparison to Human BaselineNotes
Level 2Measurement beginsN/AEstablishing tracking capability
Level 3<= 20%<= 1.5x human baselineGovernance reduces but does not eliminate rework
Level 4<= 15%<= 1.2x human baselineCertified developers and automated scanning improve quality
Level 5<= 8%<= 1.0x human baseline (parity)AI-first workflows produce code at human quality levels
Rework Is Not Always Bad

Not all rework indicates a quality problem. Rework due to changing requirements, planned refactoring, or evolving architectural patterns is normal. The metric SHOULD distinguish between defect-driven rework (indicating quality issues) and evolution-driven rework (indicating healthy iteration). Only defect-driven rework counts against the target.

Root Cause Analysis

When rework percentage exceeds targets, the following root cause categories SHOULD be investigated:

Root Cause CategoryIndicatorsRemediation
Inadequate prompt engineeringAI output misaligns with requirementsEnhanced prompt engineering training
Insufficient human reviewReviewers approve AI code without critical examinationReview process reinforcement, reviewer training
AI tool limitationSystematic patterns of incorrect output for specific task typesTool configuration adjustment or task exclusion
Missing contextAI lacks project-specific knowledge for accurate generationRAG implementation, context management improvement
Unclear requirementsRequirements are ambiguous; AI and human interpretations divergeRequirements process improvement (upstream fix)

KPI-R4: Technical Debt Ratio

Definition

Technical Debt Ratio measures AI-attributed technical debt as a proportion of the total technical debt backlog. Technical debt includes code quality issues, architectural inconsistencies, missing tests, incomplete documentation, and deferred refactoring that are specifically attributable to AI-generated code patterns.

Measurement Method

  • Numerator: Technical debt items in the backlog attributed to AI-generated code
  • Denominator: Total technical debt items in the backlog
  • Unit: Percentage
  • Aggregation: Monthly snapshot
  • Attribution method: Technical debt items are tagged with an AI-attribution flag during creation or triage. Automated static analysis tools (SonarQube, CodeClimate) MAY auto-tag debt items identified in AI-assisted code segments.
  • Debt categories: SHOULD be classified by type (code quality, architecture, testing, documentation, security)

Targets by Maturity Level

Maturity LevelTargetNotes
Level 2Measurement beginsEstablishing debt tracking capability
Level 3<= 15% of total backlogAI debt is present but contained
Level 4<= 10% of total backlogImproved code quality reduces debt generation
Level 5<= 5% of total backlogAI-first workflows produce architecturally coherent code

AI-Specific Technical Debt Patterns

Organizations SHOULD monitor for these AI-specific technical debt patterns:

Debt PatternDescriptionDetection Method
Inconsistent abstractionAI generates code at different abstraction levels than surrounding codebaseArchitecture review, static analysis
Duplicate logicAI regenerates existing functionality rather than reusing established patternsDuplicate code detection tools
Over-engineeringAI generates unnecessarily complex solutions for simple problemsCode complexity metrics (cyclomatic complexity)
Missing error handlingAI-generated happy-path code lacks robust error handlingCoverage analysis, fault injection testing
Stale patternsAI suggests patterns from training data that are deprecated in the current codebaseLinting rules, pattern matching
Test gapsAI generates implementation code without corresponding testsCoverage analysis per commit

Risk Dashboard Design

Organizations at Level 4 and above MUST maintain an integrated risk dashboard. The RECOMMENDED dashboard layout includes:

Executive View

  • Severity-weighted incident score (current quarter vs. previous quarter)
  • Security findings ratio trend (6-month view)
  • Overall risk status (green/yellow/red based on threshold breaches)

Governance Board View

  • All four risk KPIs with trend lines
  • Breakdown by team and product
  • Top 5 risk items requiring action
  • Correlation analysis between risk metrics and productivity metrics

Team View

  • Team-specific risk KPIs
  • Individual code quality trends (anonymized where required)
  • Rework items requiring attention
  • Security findings assigned to team

Escalation and Response

When risk metrics breach defined thresholds, the following escalation MUST be followed:

ConditionActionResponsible
Any single SEV1 AI-related incidentImmediate executive notification; RCA within 48 hoursCISO + VP Engineering
Security findings ratio exceeds 3.0x baselineEmergency governance board review; consider restricting AI usage for affected code areaAI Governance Board
Rework percentage exceeds 25% for two consecutive monthsRoot cause investigation; targeted training or tool adjustmentEngineering Management
Technical debt ratio exceeds 20%Dedicated debt reduction sprint; review AI coding guidelinesTeam Leads

Cross-References