UK Public Sector AI Governance Operating Model

The Reference Operating Model

Four interlocking layers govern AI from ambition to live operation. Strategic intent sets the conditions, tactical capabilities turn intent into reusable assets, operational delivery runs services safely, and continuous assurance evidences trust. Every cell maps to one or more of the source frameworks.

Strategic
Intent & Ambition

10 Foundational PrinciplesRisk appetite, public trust, ethics Risk Appetite & TieringR1 Minimal → R4 Eager Central Deployment GuidanceD1–D7 commitments; cross-government floor

Leadership ChoiceGovern strategically

Public AccountabilityTransparent records, ATRS, SIRO

Workforce & CapabilityAI literacy, multi-disciplinary teams

Tactical
Capability & Platforms

Pre-approved RoutesCommon AI use-case patterns

Shared Evaluation StandardsE2VT & test harnesses

Model RegistryVersioning, provenance, exit plan

Model Update GovernanceG2 · five triggers, tiered response

Reusable PatternsOpen source, open standards

Procurement & Supplier MgmtVendor lock-in, model-change risk

Operational
Delivery & Service

Pilot → MB → Beta → LivePhase gates with AI assurance Qs

Service Standard (14)User need, inclusion, reliability

Human-in-the-loopOverride, appeal, fallback

Monitoring & DriftHallucination, accuracy, fairness

Incident Mgmt & Kill-switchRollback, escalation, comms

Continuous
Assurance & Evidence

E2VT LoopEvaluate · Evidence · Validate · Trust

DPIA / EqIA / AIAPrivacy, equality, AI impact

Red-team & AdversarialPrompt injection, abuse cases

Audit Trail & Decision LogReproducible reassessment

RACI & Assurance BoardNamed accountable owner

Pilot

Is this worth pursuing? Can we identify the risk?

Managed Beta

Safe with limited users under controlled conditions?

Beta

Safe with real users, real data, real constraints?

Live

Safe, reliable, continuously assured at scale?

Strategic

Tactical

Operational

Assurance

Pilot

Managed Beta

Beta

Live

“Strategic AI governance is not the brake on innovation. It is the infrastructure that allows innovation to travel safely, at speed, and with public trust.” — Strategic AI Governance at Scale, GCAIO / DSIT

How to use this model

Classify the AI use case using Risk Tiering. R-tier sets the proportionate control depth.
Sequence the work through the four lifecycle phases — do not collapse pilot results into live readiness.
Apply the E2VT loop at each phase gate to generate evidence rather than assertion.
Govern using the RACI — every artefact has a named accountable owner.
Stress-test using the failure-mode matrix before formal service assessment.
Evidence using the templates: service story, evidence tracker, model card, phase decision log, incident report.

10 Foundational Principles

Drawn from the Strategic AI Governance position and aligned with the AI Risk Management Toolkit, these are the non-negotiable principles every AI-enabled public service must satisfy regardless of risk tier.

User need first

AI is justified by a real user or operational need, not by technological novelty. Non-AI alternatives have been considered.

Proportionate to risk

Controls scale with the R-tier. Vulnerable cohorts and high-consequence decisions demand independent assurance.

Meaningful human control

A named person can pause, disable, override, appeal or roll back any AI component in production.

Transparency

Users know when AI is involved, in language they understand. The service is recorded on the Algorithmic Transparency Recording Standard.

Fair and inclusive

Differential impact is tested before launch and monitored after. Accessibility evidence covers the full user base.

Secure by design

Threat-modelled for prompt injection, data exfiltration and adversarial misuse. Patched continuously.

Privacy preserving

DPIA completed. Minimum personal data principle applied to training, prompts and logs.

Evidenced, not asserted

Every assurance claim is grounded in evidence held in a discoverable evidence pack — not in one person’s head.

Reliable in operation

Runbooks, monitoring, drift alerts, supplier-failure fallback and rollback criteria all exist and have been rehearsed.

P10

Accountable & reassessable

Single SRO. Decision log maintained. Service is reassessable on amber review or material change.

Risk Tiering and Categories

Risk tiering is the cornerstone of proportionate assurance. It converts a binary “is this AI risky?” into a four-band decision that drives the depth of evaluation, the seniority of sign-off, and the cadence of re-certification.

Risk Tiers (R1–R4)

Low

Low consequence

Non-personal content search; internal lookups; low-stakes summarisation.

Basic offline evaluation; usage logs
Lightweight privacy & security checks
A/B testing only after sign-off

Medium

Staff productivity / non-critical user advice

Drafting assistants; meeting summarisation affecting decisions; non-binding guidance.

Expanded offline eval; calibration; bias screen
Red-team lite
Monitoring SLOs and drift alerts

High

Eligibility, triage, sensitive advice

Benefits triage; case-prioritisation; advice to citizens with real consequence.

Formal validation protocols (E2VT); SME panels
Strong privacy & PII controls
Comprehensive red-teaming; HITL; change-freeze windows

Critical

Safety, legal exposure, vulnerable cohorts

Decisions with legal or safety consequence; vulnerable cohorts; statutory functions.

Independent assurance & external audit
Formal validation campaigns (E2VT); rigorous canary with kill-switch
Full traceability; quarterly re-certification

The Nine Risk Categories

Aligned with the AI Risk Management Toolkit (D02 v1). Every assessed AI use case is screened against these nine categories to identify the dominant risk profile and the assurance disciplines required.

Financial

Financial losses from increased operational costs, maintaining AI solutions, or financial implications of automated decisions.

Legal & regulatory compliance

Failing to meet legal frameworks — data protection, equality law, EU AI Act, UK GDPR and evolving AI-specific laws.

Appropriate transparency & explainability

Whether users and those impacted understand how the AI works, its decision-making, and that they are interacting with AI.

Fairness

Unfair, biased or discriminatory outcomes; impact on individual rights; compliance with equality laws.

Accountability & governance

Lack of clear accountability; ineffective governance; absent risk management processes, roles and communication.

Contestability & redress

Ability of users or affected parties to contest outputs or seek redress — accessible, transparent mechanisms.

Technical robustness

Reliable functioning and sustained performance — data quality, model reliability, behaviour under unexpected conditions.

Security

Threats arising from deployment — data poisoning, leakage, cyber-attacks, prompt injection and adversarial misuse.

People & the environment

Impact on physical and mental wellbeing, safety of critical infrastructure, and the environment.

Tier × Category cross-walk

Use this view to choose proportionate controls for each of the nine categories at each tier.

Category	R1 Low	R2 Medium	R3 High	R4 Critical
Financial	Cost log	VfM check	Recurring VfM	Independent VfM audit
Legal & regulatory compliance	Legal screen	DPIA	DPIA + EqIA	DPIA + EqIA + external counsel
Appropriate transparency & explainability	Internal note	User notice	ATRS entry	ATRS + published model card
Fairness	Bias awareness	Bias screen	Sub-group testing	External fairness audit
Accountability & governance	Product owner	SRO named	Assurance board	Board + minister sighting
Contestability & redress	Email route	Help channel	Defined appeal	Statutory appeal & ombudsman
Technical robustness	Offline eval	Calibration + drift	Formal validation + canary	Independent test campaign
Security	Baseline cyber hygiene	Threat model + secrets mgmt	Red-team + prompt-injection tests	Independent pen-test + continuous monitoring
People & the environment	Low	Wellbeing check	Vulnerability & sustainability screen	Safeguarding partnership + env. impact assessment

Service Lifecycle & Phase Gates

A pilot is not a beta. A beta is not live. Each phase asks a different assurance question and demands different evidence. This view encodes the Level 3 lifecycle, the AI Assurance Questionnaire stages and the Service Standard expectations into one progression.

Phase 1 Pilot — is this worth pursuing?

Primary assurance question

Is the AI use case valid? Can we identify the risk?

AI-specific focus

User need; feasibility; risk discovery; non-AI alternatives appraised.

Expected evidence

Use-case rationale (problem → AI → outcome)
Prototype results; failure-category log
Initial risk register & data assessment
Comparison with rules-based / manual alternative

Assessor-style questions

What user need does the AI capability serve?
What non-AI options did you consider?
What risky assumptions did the pilot test?
What would cause you to stop or redesign the AI use case?
Who owns the AI risk at this phase?

Decision

Proceed to managed beta preparation · Pivot · Stop

Phase 2 Managed Beta — safe under controlled conditions?

Primary assurance question

Can the service work safely with limited users under controlled conditions?

AI-specific focus

Human control; monitoring; security & privacy; performance; accessibility; fallback.

Expected evidence

Evaluation results against representative scenarios
DPIA, threat model, model card
Test logs; support model; rollback plan
Accessibility & fairness testing pre-launch

Assessor-style questions

How do users know when AI is involved?
Where does meaningful human control happen?
What happens when the AI output is wrong?
How are prompt injection, adversarial inputs or misuse handled?
What is the rollback plan?

Decision

Proceed to private/public beta · Remediate · Pause

Phase 3 Beta — safe with real users?

Primary assurance question

Can the service work safely with real users under controlled conditions?

AI-specific focus

Live drift monitoring; supplier risk; sub-group performance; incident response readiness.

Expected evidence

Pre-production performance dashboard
External-AI integration test results & versioning controls
Rollback rehearsal log; incident runbook
Differential-impact monitoring plan

Decision

Proceed to live readiness · Remediate · Pause

Phase 4 Live — safely operating at scale?

Primary assurance question

Can the service operate safely, reliably and continuously at scale — including when the AI degrades, fails or changes?

AI-specific focus

Drift monitoring; incident response; model updates; operational ownership; continuous assurance.

Expected evidence

Runbooks & on-call escalation routes
Live dashboards: accuracy, fairness, hallucination, cost
Audit logs; retraining / update controls
Supplier-failure fallback; sustainability evidence
Quarterly re-certification record (R3/R4)

Assessor-style questions

Who can pause, disable or roll back the AI component?
How are model updates approved?
What happens if the AI supplier fails?
Is AI still the most cost-effective way to meet the user need?
What evidence will be maintained for reassessment or amber review?

Decision

Go live · Limited live · Delay · Reassess

E2VT & Service Standard alignment

Trust cannot be asserted. It has to be evidenced. The E2VT loop — Evaluate, Evidence, Validate, Trust — is the operational discipline applied at every phase gate. It maps onto the 14 Service Standard points so that AI-specific evidence travels through the same assessment route as any other digital service.

Evaluate

Test models rigorously against defined criteria. Are outcomes good enough? Is real-world impact assessed?

Evidence

Ground every assurance claim in evidence, not assumption. Controls and requirements demonstrably met.

Validate

The system meets user, policy and regulatory needs. Are we solving the right problem in the right way?

Trust

End-to-end workflows defined; monitoring & incident response live; transparency artefacts published.

14 Service Standard points — AI-enabled evidence

Standard point	AI-enabled service evidence	Typical owner
1. Understand users & needs	Research showing AI solves a real user need (not a tech preference)	User research lead
2. Solve a whole problem	End-to-end journey showing where the AI boundary starts and stops	Product / design lead
3. Joined-up channels	Assisted-digital, offline and non-AI fallback journeys defined	Service designer
4. Simple to use	Explanations, confidence language, user-facing control over AI outputs	Interaction / content lead
5. Everyone can use it	Accessibility, inclusion and bias testing covering non-standard speech, dialects, disability, low digital confidence	Design / research lead
6. Multidisciplinary team	Named AI, data, security, policy and service owners	Service owner
7. Agile ways of working	Iteration log, AI learning loop, decision records	Delivery manager
8. Iterate and improve	Evaluation cycles, controlled model-update process	Product / AI lead
9. Secure and private	DPIA, threat model, data minimisation, prompt-injection testing	Tech / security lead
10. Define success	Service KPIs plus AI accuracy, fairness and safety metrics	Performance analyst
11. Right tools and tech	AI option appraisal, model-selection rationale, exit plan	Technical architect
12. Open source	Code, prompts, configs, reusable patterns where appropriate	Tech lead
13. Open standards / components	Reuse of common platforms, open standards, shared patterns	Tech / design lead
14. Reliable service	Runbooks, monitoring, fallback, rollback, incident process	Service / ops lead

Governance Structure & RACI

Governance must be operational, not ceremonial. Each artefact has a single accountable person, a working responsible team, and clearly defined consulted and informed parties. The model layers a strategic AI Governance Board over departmental assurance boards and product-level day-to-day controls.

Three-tier governance

AI Governance Board (strategic)

Sets risk appetite, approves R-tier policy, owns ATRS publication, commissions external audits. Chaired by senior accountable officer; SIRO and DPO members.

Assurance Board (departmental)

Approves go-live, holds phase-gate decisions, owns evaluation standards and shared platforms, runs assurance clinics (Level 1–3).

Product / Service team (operational)

Executes E2VT, maintains evidence pack, runs monitoring, manages incidents. Day-to-day accountable owner is the Service / Product Owner.

Operating RACI

Area	Accountable (A)	Responsible (R)	Consulted (C)	Informed (I)
Risk Tiering	Service / Product Owner	Principal Technologist, Safety Lead	Legal, DPO	Team
Evaluation Plan	Principal Technologist	ML Eng, Data Scientist, QA	Domain SMEs, UX Researchers	AI Governance Board
Data Provenance	AI / Data Lead	ML Eng, Data Scientist	Domain SMEs, UX Researchers	AI Governance Board
Red Team	Safety Lead	Security, Red-Teamers	Product Owner, Legal	All
DPIA / EqIA	DPO	Product Owner, Legal	SIRO, User Research	Assurance Board
Go-Live decision	SRO	Service / Product Owner	Assurance Board	Stakeholders
Monitoring & Incidents	Service / Product Owner	SRE / On-call	Safety Lead, Comms	All
Re-certification	SRO	Principal Technologist	Assurance Board	AI Governance Board

Towards observable maturity

The target operating model is supported by four operational dashboards/registries — each is a capability that should exist at department or cross-government level:

Generic AI Dashboard

Service-level view of uptake, accuracy, cost, incident count.

Model Registry

Versioning, provenance, owner, R-tier, ATRS link, exit plan.

Hallucination Detection

Output quality monitoring with sub-group breakdowns.

Incident Management

Severity, escalation, comms, post-incident learning loop.

AI-enabled Failure-Mode Matrix

The bigger assessment risk is rarely whether the model works. It is whether the team can show why AI is appropriate, how harms are detected, how humans remain in meaningful control, and how the service operates reliably in live conditions. Stress-test these failure modes before assessment.

Failure mode	Pilot intervention	Managed beta intervention	Beta / live intervention	Assessment risk (points)
AI not justified by user need	Compare AI vs non-AI options	Validate with controlled users	Reassess VfM & user impact	1, 2, 11
Model output wrong or misleading	Identify failure categories	Test against representative scenarios	Monitor accuracy & incidents	9, 10, 14
Bias or exclusion	Identify protected-characteristic risks	Accessibility & fairness testing	Monitor differential impact	5, 9, 10
Prompt injection / adversarial misuse	Threat-model attack paths	Test adversarial prompts & abuse	Monitor abuse, patch controls	9, 14
Weak human control	Define human role & handoffs	Test decision handoff & override	Audit human review & escalation	4, 6, 9
Model drift	Define baseline	Monitor pre-live quality changes	Drift alerts & retraining governance	8, 10, 14
Supplier / model unavailable	Identify dependency	Test fallback	Operate fallback & incident route	11, 14
Governance unclear	Name accountable owner	Confirm governance gates	Maintain live decision log	6, 8, 14
Evidence scattered	Build evidence map	Rehearse assessor narrative	Maintain evidence pack	All 14

Answer-quality scoring (for reference)

Score	Meaning	Example
1	Assertion only	“We have tested this.”
2	Some evidence	“We ran testing and have results.”
3	Evidence linked to risk	“Testing showed these risks and these controls are in place.”
4	Evidence + control + ownership + learning	“Testing showed X; control Y is owned by Z; we revise on amber review.”

Operating Templates

Eight reusable templates that turn the operating model into daily artefacts. Edit in-browser, then use the buttons to print or copy. Together they form the minimum evidence pack any AI-enabled service should hold.

T1 · One-page Service Story

Pilot → Live

Service name

Service owner (SRO)

What problem is the service solving?

Who are the users?

What user need does the AI support?

Decision / recommendation / action the AI influences

Is AI used to build, or part of the output?

Non-AI fallback

Harm if AI output is wrong

Risk owner

T2 · Risk-tier classification

All phases

Use case

Proposed tier

Dominant risk category

Vulnerable cohorts affected?

Justification for tier

Controls applied

Sign-off (name & role)

Re-certification date

T3 · Evidence Tracker (14 Standard points)

Beta → Live

Standard area	Claim	Evidence location	Owner	RAG

T4 · Failure-mode card

Pilot → Live

Failure mode title

Plausible scenario

How would this be detected?

User-facing safeguard

Human review / appeal route

Incident owner

Evidence shown to assessor

Likelihood × Impact (1–5)

T5 · Model Card

Managed beta → Live

Model name & version

Supplier / origin

Intended use

Out-of-scope uses

Training / fine-tuning data summary

Evaluation metrics & results

Sub-group performance

Known limitations & failure modes

Update / retrain process

ATRS reference

T6 · Phase-gate decision log

Every gate

Phase

Decision date

Decision

Decision maker (SRO)

Evidence reviewed

Outstanding risks & owners

Conditions of proceed

Next review date

T7 · AI Incident Report

Live

Incident ID

Severity

What happened?

How was it detected?

Affected users / cohorts

Containment action

Was kill-switch used?

Root cause

Notifications (SIRO, DPO, ICO, Minister)

Lessons fed back to operating model

T8 · Service RACI worksheet

Setup

Activity	A	R	C	I

Centrally Endorsed Guidance for Responsible Deployment

Departmental approaches to AI assurance are inconsistent today — some teams over-engineer controls, others under-control. This section sets a single, written and centrally endorsed position covering responsible deployment, model-update governance, accuracy expectations and acceptable risk thresholds. It is designed to unblock safe experimentation by making the rules explicit, so teams stop guessing and risk-aversion stops acting as a default veto.

“A consistent, written deployment standard means innovation can travel safely across government. Where the rules are clear, departments can move faster — not slower.” — Reference Operating Model, central guidance principle

G1 · Responsible deployment standard

Every AI-enabled service deployed in the UK public sector must satisfy the following seven mandatory commitments, regardless of risk tier or department. Anything below this floor is not a deployment, it is an experiment and must remain inside a controlled environment.

Named SRO & risk tier

A single accountable owner and an explicit R1–R4 classification recorded against the service. Aligns: Risk Toolkit RACI; AAQ governance stage.

DPIA / EqIA completed

Privacy and equality impact assessments signed off before any real user touches the service. Aligns: AAQ Ethics; Risk Toolkit categories Legal & Fairness.

ATRS entry published (R2+)

Algorithmic Transparency Recording Standard entry live before public exposure for R2 and above. Aligns: Risk Toolkit Transparency category.

Meaningful human control

Override, appeal, fallback and kill-switch routes defined, tested and owned. Aligns: AAQ governance; Service Standard pt 4.

Monitoring & drift plan

Live dashboards for accuracy, fairness, hallucination and cost — with thresholds defined before launch. Aligns: Toolkit Technical Robustness; AAQ Lifecycle.

Inclusion evidence

Accessibility and sub-group testing covering dialects, accents, low digital confidence and disability. Aligns: Research Assessment; Service Standard pt 5.

Evidence pack & decision log

Single discoverable evidence pack and phase-gate decision log; not held in one person’s head. Aligns: Level 3 Evidence Tracker; AAQ v4.3.

G2 · Model update governance

Model behaviour changes — either intentionally (retrain, prompt edit, version bump) or externally (supplier model swap, fine-tuning). All four trigger the same governance path. The depth of the path is set by R-tier.

Update type	Trigger	R1	R2	R3	R4
Prompt / configuration change	Team-initiated	Peer review & log	Peer review + regression suite	Change Advisory Board + canary	CAB + change-freeze respected + canary + rollback rehearsal
Retrain / fine-tune	Drift, new data or scheduled	Baseline diff	Eval pack rerun	Full E2VT rerun + SME review	Independent re-validation + external audit hook
Version bump (own model)	Release process	Semver + changelog	Eval rerun + SLO check	Canary 5 → 25 → 100% with kill-switch	Quarterly re-certification supersedes
Supplier model change	Provider-initiated	Notify owner	Regression + bias delta	Pause flow + revalidate + ATRS update	Halt service until revalidated & signed off by SRO
Knowledge base / RAG update	Content team	Quality spot-check	Hallucination diff vs baseline	SME panel + factuality eval	SME + independent eval, with citation audit

Hard rule for all tiers: no silent updates. Every model change leaves a record in the Model Registry, the decision log and (R2+) the ATRS entry. Where the change is provider-initiated, the service operates under a presumption to pause until revalidation completes.

G3 · Accuracy expectations

Accuracy is not a single number — and "high accuracy" is not a control. The central position is that each service must declare four metrics with explicit thresholds, set before launch and proportionate to tier. Below the lower threshold the service must pause; in the amber band it must remediate; above the green threshold it may operate.

Metric family	What it measures	R1 floor	R2 floor	R3 floor	R4 floor
Task accuracy	Correct outputs on golden / held-out set	≥ 70%	≥ 85%	≥ 92%	≥ 95% with CI reported
Sub-group parity	Max gap in accuracy across protected groups	≤ 15 pp	≤ 10 pp	≤ 5 pp	≤ 3 pp with rationale
Hallucination / factuality	Unsupported claims on representative prompts	≤ 10%	≤ 5%	≤ 2%	≤ 1% with citation audit
Refusal & safety	Correct refusal of unsafe / out-of-scope queries	≥ 80%	≥ 90%	≥ 95%	≥ 98% with red-team set

Thresholds are reference defaults. Departments may set stricter values for the same tier; they may not relax them without AI Governance Board approval. pp = percentage points.

G4 · Acceptable risk thresholds

Mapped to the Orange Book-aligned appetite scale (Averse → Eager) used in the AI Risk Management Toolkit, against the nine risk categories.

Risk category	Central position	Floor (will not accept below)	Trigger for pause
Financial	Open	Project not solvent without AI subsidy	Run-rate > 1.5× business case
Legal & regulatory compliance	Minimal	Any unresolved legal challenge with material likelihood	DPIA / EqIA finding above ‘low’ not remediated
Appropriate transparency & explainability	Cautious	Users cannot tell they are using AI	ATRS entry missing / out of date at R2+
Fairness	Minimal	Sub-group parity gap > G3 floor	Live monitoring shows widening gap over 2 cycles
Accountability & governance	Cautious	No named SRO; no decision log	SRO change without handover within 10 working days
Contestability & redress	Cautious	No redress channel defined	Redress SLA breach rate > 5%
Technical robustness	Open	Below G3 task-accuracy floor	Drift alert sustained for > 1 cycle without action
Security	Averse	Unmitigated prompt-injection or data-exfil risk	Any sev-1 incident; CVE on the model path
People & the environment	Minimal	Foreseeable harm to vulnerable cohort without safeguard	Safeguarding incident, or sustainability budget breach

G5 · Pre-approved deployment routes

To remove the most common cause of departmental risk-aversion — uncertainty about what is allowed — the following pre-approved patterns may be deployed at R1/R2 using the reference controls only, without bespoke board review:

Pattern A — Internal staff productivity assistant (drafting, summarisation, search) on non-personal corporate data. Requires D1, D5, D7.
Pattern B — Internal knowledge retrieval (RAG) over departmental knowledge base, with citation. Requires D1, D2, D5, D7.
Pattern C — User-facing informational assistant with explicit "AI-generated" labelling and human-reviewed corpus. Requires D1–D7.
Pattern D — Caseworker triage / prioritisation (advisory, not decisional). Requires D1–D7 and sub-group parity evidence (G3).

Anything outside the pre-approved patterns, or at R3/R4, requires departmental Assurance Board sign-off plus AI Governance Board notification.

G6 · Consistency across government

To prevent departmental drift, the following are common across all departments and may not be re-defined locally:

Risk tiers R1–R4

Single definition used across HMG. Departments may add sub-tiers, not redefine the four.

Nine risk categories

AI Risk Management Toolkit D02 v1 categories used verbatim.

Seven deployment commitments (D1–D7)

Floor for any live deployment; departments may add, not subtract.

G3 accuracy floors

Reference minima per tier; stricter allowed, looser requires AI Gov Board approval.

Model-update governance (G2)

Same five triggers, same four tier responses across departments.

Security baseline

Appetite = Averse for security; common minimum across HMG.

Fairness floors

Sub-group parity gap thresholds in G3 are common minima.

Incident notification

Sev-1 incidents notified to SIRO, DPO and AI Gov Board within 24 hours.

G7 · How this navigates risk aversion

The most common failure pattern observed in departmental practice is not deploying badly — it is not deploying at all, because no one is sure what "good enough" looks like. This guidance addresses each pattern:

Symptom of risk aversion	Central guidance response
“We don’t know what controls are required.”	D1–D7 deployment commitments + Tier×Category cross-walk are written and endorsed.
“What accuracy is good enough?”	G3 declares default floors per tier; departments may set stricter, not looser.
“How do we change the model safely?”	G2 sets the five update triggers and tiered response.
“What if the supplier changes the model?”	G2 supplier-change row + ‘presumption to pause’.
“Do we need board sign-off for every use case?”	No — G5 pre-approved patterns are deployable without bespoke review at R1/R2.
“Our department does it differently to others.”	G6 fixes the non-negotiable common ground; local addition allowed, redefinition not.
“How do we know when to stop?”	G4 pause triggers + G3 amber/red thresholds are explicit, not judgemental.

G8 · Where this guidance lives in the operating model

Strategic layer — G1, G4 and G6 sit with the AI Governance Board (L1).
Tactical layer — G2, G3 and G5 are owned by the departmental Assurance Board (L2) and implemented through shared evaluation standards and the Model Registry.
Operational layer — D1–D7 are enforced at phase gates by the Service / Product Owner (L3) using the templates in the Templates tab.
Assurance layer — G7 pause triggers feed directly into the Failure-Mode Matrix and the AAQ v4.3 evidence questions.

Framework Synergy & Critical Alignment

The five source frameworks each address a slice of the AI assurance problem. The reference operating model integrates them so a single piece of evidence satisfies multiple frameworks simultaneously — reducing duplication and exposing genuine gaps.

Strategic AI Governance

Sets ambition & risk appetite

10 principles, R1–R4 tiering, leadership choice, target operating model components.

Feeds → Operating Model L1 (Strategic), Risk Tiering, all phase gates.

Level 3 Intervention

Service-assessment readiness

Lifecycle integrity (pilot ≠ beta ≠ live), failure-mode drill, assessor-role play, evidence challenge against the 14 points.

Feeds → Service Lifecycle, Failure Modes, Templates T1/T3/T4/T6.

AI Assurance Questionnaire v4.3

146 evidenced questions

Question bank mapped to project governance stage, lifecycle stage, risk tier and ethical dimension.

Feeds → E2VT evaluation step; phase-gate questions; Evidence Tracker T3.

AI Risk Management Toolkit

Risk identification & control

Nine risk categories (D02 v1), Orange Book-aligned appetite scale, RACI, control taxonomy and mitigation patterns.

Feeds → Risk Categories, Governance RACI, Templates T2/T7.

Research Assessment

AI in user research

Hallucination checks on research outputs, synthetic-user validation, dialect & accent coverage.

Feeds → Standard points 1, 5; failure modes (bias, exclusion); evidence of inclusion.

Cross-framework synergy matrix

Where the same operating-model area is informed by multiple frameworks, alignment is genuine. Gaps below are where the reference operating model adds bridging guidance.

OPM area	Strategic Gov	Level 3	AAQ v4.3	Risk Toolkit	Research
Risk appetite / tiering	● primary	○	○	●	–
Service lifecycle gates	○	● primary	●	○	○
Evidence & questioning	○	●	● primary	○	○
Failure modes & controls	○	●	○	● primary	○
User research integrity	–	○	○	–	● primary
Operating dashboards	● primary	–	–	○	–
RACI & accountability	●	○	○	● primary	–

● primary contributor · ○ reinforcing contributor · – not in scope

Critical observations

Synergy: Risk tiering (Strategic Gov) and failure-mode controls (Risk Toolkit) align cleanly — tier sets depth, toolkit sets shape. Use Templates T2 and T4 together.
Synergy: Level 3 lifecycle phasing and AAQ v4.3 lifecycle stages are isomorphic — AAQ questions plug directly into the phase-gate accordions on the Lifecycle tab.
Gap bridged: Research Assessment scrutinises hallucinated research outputs but is silent on live operation. The OPM routes its findings into Standard points 1 and 5, then into live drift monitoring.
Gap bridged: Strategic Gov describes target dashboards (registry, hallucination, incident) but not who runs them. The Governance RACI assigns L2 Assurance Board ownership.
Risk to mitigate: All five frameworks treat “evidence” differently. The Evidence Tracker (T3) consolidates to a single pack per service to prevent scatter — the most cited Level 3 failure.

“The hard part is not access to AI. The hard part is operating model maturity. Without clear governance, assurance and accountability, AI adoption fragments.” — Strategic AI Governance at Scale

Governing AI-Enabled Public Services at Scale

The Reference Operating Model

How to use this model

10 Foundational Principles

Risk Tiering and Categories

Risk Tiers (R1–R4)

The Nine Risk Categories

Tier × Category cross-walk

Service Lifecycle & Phase Gates

Primary assurance question

AI-specific focus

Expected evidence

Assessor-style questions

Decision

Primary assurance question

AI-specific focus

Expected evidence

Assessor-style questions

Decision

Primary assurance question

AI-specific focus

Expected evidence

Decision

Primary assurance question

AI-specific focus

Expected evidence

Assessor-style questions

Decision

E2VT & Service Standard alignment

Evaluate

Evidence

Validate

Trust

14 Service Standard points — AI-enabled evidence

Governance Structure & RACI

Three-tier governance

Operating RACI

Towards observable maturity

AI-enabled Failure-Mode Matrix

Answer-quality scoring (for reference)

Operating Templates

Centrally Endorsed Guidance for Responsible Deployment

G1 · Responsible deployment standard

G2 · Model update governance

G3 · Accuracy expectations

G4 · Acceptable risk thresholds

G5 · Pre-approved deployment routes

G6 · Consistency across government

G7 · How this navigates risk aversion

G8 · Where this guidance lives in the operating model

Framework Synergy & Critical Alignment

Cross-framework synergy matrix

Critical observations