The Reference Operating Model
Four interlocking layers govern AI from ambition to live operation. Strategic intent sets the conditions, tactical capabilities turn intent into reusable assets, operational delivery runs services safely, and continuous assurance evidences trust. Every cell maps to one or more of the source frameworks.
Intent & Ambition
Capability & Platforms
Delivery & Service
Assurance & Evidence
How to use this model
- Classify the AI use case using Risk Tiering. R-tier sets the proportionate control depth.
- Sequence the work through the four lifecycle phases — do not collapse pilot results into live readiness.
- Apply the E2VT loop at each phase gate to generate evidence rather than assertion.
- Govern using the RACI — every artefact has a named accountable owner.
- Stress-test using the failure-mode matrix before formal service assessment.
- Evidence using the templates: service story, evidence tracker, model card, phase decision log, incident report.
10 Foundational Principles
Drawn from the Strategic AI Governance position and aligned with the AI Risk Management Toolkit, these are the non-negotiable principles every AI-enabled public service must satisfy regardless of risk tier.
Risk Tiering and Categories
Risk tiering is the cornerstone of proportionate assurance. It converts a binary “is this AI risky?” into a four-band decision that drives the depth of evaluation, the seniority of sign-off, and the cadence of re-certification.
Risk Tiers (R1–R4)
- Basic offline evaluation; usage logs
- Lightweight privacy & security checks
- A/B testing only after sign-off
- Expanded offline eval; calibration; bias screen
- Red-team lite
- Monitoring SLOs and drift alerts
- Formal validation protocols (E2VT); SME panels
- Strong privacy & PII controls
- Comprehensive red-teaming; HITL; change-freeze windows
- Independent assurance & external audit
- Formal validation campaigns (E2VT); rigorous canary with kill-switch
- Full traceability; quarterly re-certification
The Nine Risk Categories
Aligned with the AI Risk Management Toolkit (D02 v1). Every assessed AI use case is screened against these nine categories to identify the dominant risk profile and the assurance disciplines required.
Tier × Category cross-walk
Use this view to choose proportionate controls for each of the nine categories at each tier.
| Category | R1 Low | R2 Medium | R3 High | R4 Critical |
|---|---|---|---|---|
| Financial | Cost log | VfM check | Recurring VfM | Independent VfM audit |
| Legal & regulatory compliance | Legal screen | DPIA | DPIA + EqIA | DPIA + EqIA + external counsel |
| Appropriate transparency & explainability | Internal note | User notice | ATRS entry | ATRS + published model card |
| Fairness | Bias awareness | Bias screen | Sub-group testing | External fairness audit |
| Accountability & governance | Product owner | SRO named | Assurance board | Board + minister sighting |
| Contestability & redress | Email route | Help channel | Defined appeal | Statutory appeal & ombudsman |
| Technical robustness | Offline eval | Calibration + drift | Formal validation + canary | Independent test campaign |
| Security | Baseline cyber hygiene | Threat model + secrets mgmt | Red-team + prompt-injection tests | Independent pen-test + continuous monitoring |
| People & the environment | Low | Wellbeing check | Vulnerability & sustainability screen | Safeguarding partnership + env. impact assessment |
Service Lifecycle & Phase Gates
A pilot is not a beta. A beta is not live. Each phase asks a different assurance question and demands different evidence. This view encodes the Level 3 lifecycle, the AI Assurance Questionnaire stages and the Service Standard expectations into one progression.
Phase 1 Pilot — is this worth pursuing?
Primary assurance question
Is the AI use case valid? Can we identify the risk?
AI-specific focus
User need; feasibility; risk discovery; non-AI alternatives appraised.
Expected evidence
- Use-case rationale (problem → AI → outcome)
- Prototype results; failure-category log
- Initial risk register & data assessment
- Comparison with rules-based / manual alternative
Assessor-style questions
- What user need does the AI capability serve?
- What non-AI options did you consider?
- What risky assumptions did the pilot test?
- What would cause you to stop or redesign the AI use case?
- Who owns the AI risk at this phase?
Decision
Proceed to managed beta preparation · Pivot · Stop
Phase 2 Managed Beta — safe under controlled conditions?
Primary assurance question
Can the service work safely with limited users under controlled conditions?
AI-specific focus
Human control; monitoring; security & privacy; performance; accessibility; fallback.
Expected evidence
- Evaluation results against representative scenarios
- DPIA, threat model, model card
- Test logs; support model; rollback plan
- Accessibility & fairness testing pre-launch
Assessor-style questions
- How do users know when AI is involved?
- Where does meaningful human control happen?
- What happens when the AI output is wrong?
- How are prompt injection, adversarial inputs or misuse handled?
- What is the rollback plan?
Decision
Proceed to private/public beta · Remediate · Pause
Phase 3 Beta — safe with real users?
Primary assurance question
Can the service work safely with real users under controlled conditions?
AI-specific focus
Live drift monitoring; supplier risk; sub-group performance; incident response readiness.
Expected evidence
- Pre-production performance dashboard
- External-AI integration test results & versioning controls
- Rollback rehearsal log; incident runbook
- Differential-impact monitoring plan
Decision
Proceed to live readiness · Remediate · Pause
Phase 4 Live — safely operating at scale?
Primary assurance question
Can the service operate safely, reliably and continuously at scale — including when the AI degrades, fails or changes?
AI-specific focus
Drift monitoring; incident response; model updates; operational ownership; continuous assurance.
Expected evidence
- Runbooks & on-call escalation routes
- Live dashboards: accuracy, fairness, hallucination, cost
- Audit logs; retraining / update controls
- Supplier-failure fallback; sustainability evidence
- Quarterly re-certification record (R3/R4)
Assessor-style questions
- Who can pause, disable or roll back the AI component?
- How are model updates approved?
- What happens if the AI supplier fails?
- Is AI still the most cost-effective way to meet the user need?
- What evidence will be maintained for reassessment or amber review?
Decision
Go live · Limited live · Delay · Reassess
E2VT & Service Standard alignment
Trust cannot be asserted. It has to be evidenced. The E2VT loop — Evaluate, Evidence, Validate, Trust — is the operational discipline applied at every phase gate. It maps onto the 14 Service Standard points so that AI-specific evidence travels through the same assessment route as any other digital service.
Evaluate
Test models rigorously against defined criteria. Are outcomes good enough? Is real-world impact assessed?
Evidence
Ground every assurance claim in evidence, not assumption. Controls and requirements demonstrably met.
Validate
The system meets user, policy and regulatory needs. Are we solving the right problem in the right way?
Trust
End-to-end workflows defined; monitoring & incident response live; transparency artefacts published.
14 Service Standard points — AI-enabled evidence
| Standard point | AI-enabled service evidence | Typical owner |
|---|---|---|
| 1. Understand users & needs | Research showing AI solves a real user need (not a tech preference) | User research lead |
| 2. Solve a whole problem | End-to-end journey showing where the AI boundary starts and stops | Product / design lead |
| 3. Joined-up channels | Assisted-digital, offline and non-AI fallback journeys defined | Service designer |
| 4. Simple to use | Explanations, confidence language, user-facing control over AI outputs | Interaction / content lead |
| 5. Everyone can use it | Accessibility, inclusion and bias testing covering non-standard speech, dialects, disability, low digital confidence | Design / research lead |
| 6. Multidisciplinary team | Named AI, data, security, policy and service owners | Service owner |
| 7. Agile ways of working | Iteration log, AI learning loop, decision records | Delivery manager |
| 8. Iterate and improve | Evaluation cycles, controlled model-update process | Product / AI lead |
| 9. Secure and private | DPIA, threat model, data minimisation, prompt-injection testing | Tech / security lead |
| 10. Define success | Service KPIs plus AI accuracy, fairness and safety metrics | Performance analyst |
| 11. Right tools and tech | AI option appraisal, model-selection rationale, exit plan | Technical architect |
| 12. Open source | Code, prompts, configs, reusable patterns where appropriate | Tech lead |
| 13. Open standards / components | Reuse of common platforms, open standards, shared patterns | Tech / design lead |
| 14. Reliable service | Runbooks, monitoring, fallback, rollback, incident process | Service / ops lead |
Governance Structure & RACI
Governance must be operational, not ceremonial. Each artefact has a single accountable person, a working responsible team, and clearly defined consulted and informed parties. The model layers a strategic AI Governance Board over departmental assurance boards and product-level day-to-day controls.
Three-tier governance
Operating RACI
| Area | Accountable (A) | Responsible (R) | Consulted (C) | Informed (I) |
|---|---|---|---|---|
| Risk Tiering | Service / Product Owner | Principal Technologist, Safety Lead | Legal, DPO | Team |
| Evaluation Plan | Principal Technologist | ML Eng, Data Scientist, QA | Domain SMEs, UX Researchers | AI Governance Board |
| Data Provenance | AI / Data Lead | ML Eng, Data Scientist | Domain SMEs, UX Researchers | AI Governance Board |
| Red Team | Safety Lead | Security, Red-Teamers | Product Owner, Legal | All |
| DPIA / EqIA | DPO | Product Owner, Legal | SIRO, User Research | Assurance Board |
| Go-Live decision | SRO | Service / Product Owner | Assurance Board | Stakeholders |
| Monitoring & Incidents | Service / Product Owner | SRE / On-call | Safety Lead, Comms | All |
| Re-certification | SRO | Principal Technologist | Assurance Board | AI Governance Board |
Towards observable maturity
The target operating model is supported by four operational dashboards/registries — each is a capability that should exist at department or cross-government level:
AI-enabled Failure-Mode Matrix
The bigger assessment risk is rarely whether the model works. It is whether the team can show why AI is appropriate, how harms are detected, how humans remain in meaningful control, and how the service operates reliably in live conditions. Stress-test these failure modes before assessment.
| Failure mode | Pilot intervention | Managed beta intervention | Beta / live intervention | Assessment risk (points) |
|---|---|---|---|---|
| AI not justified by user need | Compare AI vs non-AI options | Validate with controlled users | Reassess VfM & user impact | 1, 2, 11 |
| Model output wrong or misleading | Identify failure categories | Test against representative scenarios | Monitor accuracy & incidents | 9, 10, 14 |
| Bias or exclusion | Identify protected-characteristic risks | Accessibility & fairness testing | Monitor differential impact | 5, 9, 10 |
| Prompt injection / adversarial misuse | Threat-model attack paths | Test adversarial prompts & abuse | Monitor abuse, patch controls | 9, 14 |
| Weak human control | Define human role & handoffs | Test decision handoff & override | Audit human review & escalation | 4, 6, 9 |
| Model drift | Define baseline | Monitor pre-live quality changes | Drift alerts & retraining governance | 8, 10, 14 |
| Supplier / model unavailable | Identify dependency | Test fallback | Operate fallback & incident route | 11, 14 |
| Governance unclear | Name accountable owner | Confirm governance gates | Maintain live decision log | 6, 8, 14 |
| Evidence scattered | Build evidence map | Rehearse assessor narrative | Maintain evidence pack | All 14 |
Answer-quality scoring (for reference)
| Score | Meaning | Example |
|---|---|---|
| 1 | Assertion only | “We have tested this.” |
| 2 | Some evidence | “We ran testing and have results.” |
| 3 | Evidence linked to risk | “Testing showed these risks and these controls are in place.” |
| 4 | Evidence + control + ownership + learning | “Testing showed X; control Y is owned by Z; we revise on amber review.” |
Operating Templates
Eight reusable templates that turn the operating model into daily artefacts. Edit in-browser, then use the buttons to print or copy. Together they form the minimum evidence pack any AI-enabled service should hold.
| Standard area | Claim | Evidence location | Owner | RAG |
|---|
| Activity | A | R | C | I |
|---|
Centrally Endorsed Guidance for Responsible Deployment
Departmental approaches to AI assurance are inconsistent today — some teams over-engineer controls, others under-control. This section sets a single, written and centrally endorsed position covering responsible deployment, model-update governance, accuracy expectations and acceptable risk thresholds. It is designed to unblock safe experimentation by making the rules explicit, so teams stop guessing and risk-aversion stops acting as a default veto.
G1 · Responsible deployment standard
Every AI-enabled service deployed in the UK public sector must satisfy the following seven mandatory commitments, regardless of risk tier or department. Anything below this floor is not a deployment, it is an experiment and must remain inside a controlled environment.
G2 · Model update governance
Model behaviour changes — either intentionally (retrain, prompt edit, version bump) or externally (supplier model swap, fine-tuning). All four trigger the same governance path. The depth of the path is set by R-tier.
| Update type | Trigger | R1 | R2 | R3 | R4 |
|---|---|---|---|---|---|
| Prompt / configuration change | Team-initiated | Peer review & log | Peer review + regression suite | Change Advisory Board + canary | CAB + change-freeze respected + canary + rollback rehearsal |
| Retrain / fine-tune | Drift, new data or scheduled | Baseline diff | Eval pack rerun | Full E2VT rerun + SME review | Independent re-validation + external audit hook |
| Version bump (own model) | Release process | Semver + changelog | Eval rerun + SLO check | Canary 5 → 25 → 100% with kill-switch | Quarterly re-certification supersedes |
| Supplier model change | Provider-initiated | Notify owner | Regression + bias delta | Pause flow + revalidate + ATRS update | Halt service until revalidated & signed off by SRO |
| Knowledge base / RAG update | Content team | Quality spot-check | Hallucination diff vs baseline | SME panel + factuality eval | SME + independent eval, with citation audit |
Hard rule for all tiers: no silent updates. Every model change leaves a record in the Model Registry, the decision log and (R2+) the ATRS entry. Where the change is provider-initiated, the service operates under a presumption to pause until revalidation completes.
G3 · Accuracy expectations
Accuracy is not a single number — and "high accuracy" is not a control. The central position is that each service must declare four metrics with explicit thresholds, set before launch and proportionate to tier. Below the lower threshold the service must pause; in the amber band it must remediate; above the green threshold it may operate.
| Metric family | What it measures | R1 floor | R2 floor | R3 floor | R4 floor |
|---|---|---|---|---|---|
| Task accuracy | Correct outputs on golden / held-out set | ≥ 70% | ≥ 85% | ≥ 92% | ≥ 95% with CI reported |
| Sub-group parity | Max gap in accuracy across protected groups | ≤ 15 pp | ≤ 10 pp | ≤ 5 pp | ≤ 3 pp with rationale |
| Hallucination / factuality | Unsupported claims on representative prompts | ≤ 10% | ≤ 5% | ≤ 2% | ≤ 1% with citation audit |
| Refusal & safety | Correct refusal of unsafe / out-of-scope queries | ≥ 80% | ≥ 90% | ≥ 95% | ≥ 98% with red-team set |
Thresholds are reference defaults. Departments may set stricter values for the same tier; they may not relax them without AI Governance Board approval. pp = percentage points.
G4 · Acceptable risk thresholds
Mapped to the Orange Book-aligned appetite scale (Averse → Eager) used in the AI Risk Management Toolkit, against the nine risk categories.
| Risk category | Central position | Floor (will not accept below) | Trigger for pause |
|---|---|---|---|
| Financial | Open | Project not solvent without AI subsidy | Run-rate > 1.5× business case |
| Legal & regulatory compliance | Minimal | Any unresolved legal challenge with material likelihood | DPIA / EqIA finding above ‘low’ not remediated |
| Appropriate transparency & explainability | Cautious | Users cannot tell they are using AI | ATRS entry missing / out of date at R2+ |
| Fairness | Minimal | Sub-group parity gap > G3 floor | Live monitoring shows widening gap over 2 cycles |
| Accountability & governance | Cautious | No named SRO; no decision log | SRO change without handover within 10 working days |
| Contestability & redress | Cautious | No redress channel defined | Redress SLA breach rate > 5% |
| Technical robustness | Open | Below G3 task-accuracy floor | Drift alert sustained for > 1 cycle without action |
| Security | Averse | Unmitigated prompt-injection or data-exfil risk | Any sev-1 incident; CVE on the model path |
| People & the environment | Minimal | Foreseeable harm to vulnerable cohort without safeguard | Safeguarding incident, or sustainability budget breach |
G5 · Pre-approved deployment routes
To remove the most common cause of departmental risk-aversion — uncertainty about what is allowed — the following pre-approved patterns may be deployed at R1/R2 using the reference controls only, without bespoke board review:
- Pattern A — Internal staff productivity assistant (drafting, summarisation, search) on non-personal corporate data. Requires D1, D5, D7.
- Pattern B — Internal knowledge retrieval (RAG) over departmental knowledge base, with citation. Requires D1, D2, D5, D7.
- Pattern C — User-facing informational assistant with explicit "AI-generated" labelling and human-reviewed corpus. Requires D1–D7.
- Pattern D — Caseworker triage / prioritisation (advisory, not decisional). Requires D1–D7 and sub-group parity evidence (G3).
Anything outside the pre-approved patterns, or at R3/R4, requires departmental Assurance Board sign-off plus AI Governance Board notification.
G6 · Consistency across government
To prevent departmental drift, the following are common across all departments and may not be re-defined locally:
G7 · How this navigates risk aversion
The most common failure pattern observed in departmental practice is not deploying badly — it is not deploying at all, because no one is sure what "good enough" looks like. This guidance addresses each pattern:
| Symptom of risk aversion | Central guidance response |
|---|---|
| “We don’t know what controls are required.” | D1–D7 deployment commitments + Tier×Category cross-walk are written and endorsed. |
| “What accuracy is good enough?” | G3 declares default floors per tier; departments may set stricter, not looser. |
| “How do we change the model safely?” | G2 sets the five update triggers and tiered response. |
| “What if the supplier changes the model?” | G2 supplier-change row + ‘presumption to pause’. |
| “Do we need board sign-off for every use case?” | No — G5 pre-approved patterns are deployable without bespoke review at R1/R2. |
| “Our department does it differently to others.” | G6 fixes the non-negotiable common ground; local addition allowed, redefinition not. |
| “How do we know when to stop?” | G4 pause triggers + G3 amber/red thresholds are explicit, not judgemental. |
G8 · Where this guidance lives in the operating model
- Strategic layer — G1, G4 and G6 sit with the AI Governance Board (L1).
- Tactical layer — G2, G3 and G5 are owned by the departmental Assurance Board (L2) and implemented through shared evaluation standards and the Model Registry.
- Operational layer — D1–D7 are enforced at phase gates by the Service / Product Owner (L3) using the templates in the Templates tab.
- Assurance layer — G7 pause triggers feed directly into the Failure-Mode Matrix and the AAQ v4.3 evidence questions.
Framework Synergy & Critical Alignment
The five source frameworks each address a slice of the AI assurance problem. The reference operating model integrates them so a single piece of evidence satisfies multiple frameworks simultaneously — reducing duplication and exposing genuine gaps.
Cross-framework synergy matrix
Where the same operating-model area is informed by multiple frameworks, alignment is genuine. Gaps below are where the reference operating model adds bridging guidance.
| OPM area | Strategic Gov | Level 3 | AAQ v4.3 | Risk Toolkit | Research |
|---|---|---|---|---|---|
| Risk appetite / tiering | ● primary | ○ | ○ | ● | – |
| Service lifecycle gates | ○ | ● primary | ● | ○ | ○ |
| Evidence & questioning | ○ | ● | ● primary | ○ | ○ |
| Failure modes & controls | ○ | ● | ○ | ● primary | ○ |
| User research integrity | – | ○ | ○ | – | ● primary |
| Operating dashboards | ● primary | – | – | ○ | – |
| RACI & accountability | ● | ○ | ○ | ● primary | – |
● primary contributor · ○ reinforcing contributor · – not in scope
Critical observations
- Synergy: Risk tiering (Strategic Gov) and failure-mode controls (Risk Toolkit) align cleanly — tier sets depth, toolkit sets shape. Use Templates T2 and T4 together.
- Synergy: Level 3 lifecycle phasing and AAQ v4.3 lifecycle stages are isomorphic — AAQ questions plug directly into the phase-gate accordions on the Lifecycle tab.
- Gap bridged: Research Assessment scrutinises hallucinated research outputs but is silent on live operation. The OPM routes its findings into Standard points 1 and 5, then into live drift monitoring.
- Gap bridged: Strategic Gov describes target dashboards (registry, hallucination, incident) but not who runs them. The Governance RACI assigns L2 Assurance Board ownership.
- Risk to mitigate: All five frameworks treat “evidence” differently. The Evidence Tracker (T3) consolidates to a single pack per service to prevent scatter — the most cited Level 3 failure.