Validation of artificial intelligence systems

Risk advisory

By: Daniel Fernández Domínguez, Lukas Majer, Dwayne Price, Jonathan Fitzpatrick, Juan García

03 Jun 2026 15 min read

Validating AI systems becomes essential as generative and agentic AI models are integrated into critical processes. These technologies already influence decision-making, flow automation and the generation of analyses, which forces you to strengthen control, traceability and compliance to manage your risk properly.

Contents

Introduction

Artificial intelligence is transforming the way organizations operate, reshaping processes, accelerating analysis, and enabling new ways of generating insights. As AI systems move from experimentation to core infrastructure, institutions must establish robust risk management frameworks to ensure effective monitoring of model performance through validation, monitoring, and accountability. Existing model risk practices must evolve to address the behavioral and operational characteristics of both generative and agentic AI models, while maintaining trust and regulatory compliance to support sound decision-making.

Regulatory expectations are converging between EU and UK jurisdictions, requiring AI systems to meet standards of explainability, traceability, governance, and human oversight. In the following article we propose a model risk framework designed to meet these expectations by integrating behavioral assurance, auditability, and control mechanisms, enabling institutions to scale AI models securely and with confidence.

The framework presented here follows a structured approach to comprehensive validation that encompasses data quality and security, behavioral testing, outcome evaluation, human in the loop, continuous monitoring, and remediation mechanisms.

The new generation of AI systems

Machine learning models have traditionally focused on statistical prediction, estimation of probabilities, classification of results or detection of patterns in structured data. These models typically produce numerical results and are primarily evaluated using quantitative performance metrics. However, modern AI systems go beyond predictive modeling. They combine LLM models with capabilities such as reasoning, information retrieval, planning, and interaction with external tools.

As a result, these systems behave more like analytical assistants than traditional statistical models. For the purposes of this article, we focus on these AI systems, which typically consist of two key components:

Generative AI: Systems used to produce reasoning and narrative explanation.
Agentic AI: systems that can pursue goals, make decisions in several steps, use tools and act autonomously.

Generative AI represents the foundational layer of modern AI capabilities. These models stand out for producing coherent and high-quality narratives: summarizing documents, reorganizing evidence, rewriting notes, and structuring complex behaviors into clear explanations.

Their strengths lie in interpretation and documentation, not in performing tasks such as performing calculations or inferring lost data. They improve human efficiency, but they are not a substitute for human judgment.

Agentic AI extends generative AI by introducing the ability to design structured plans, identify tools, and execute sequential tasks, replacing human intervention. This introduces additional complexities from a model risk management perspective, including setting boundaries, governing tools, and checking the accuracy of results.

Thus, while generative AI and agentic AI are distinct types of systems, for validation purposes generative AI can be treated as the zero-autonomy baseline to which decision-making and agetic execution capabilities are added.

Several components of agent systems commonly appear in financial services workflows, each entailing distinct behaviors and risks that need to be validated. These include:

Planning agents that break down tasks into structured steps and sequence actions.
Recovery agents that locate information from documents or databases.
Agents that use tools and interact with calculators, APIs, or internal systems to perform actions.
Orchestration agents that decide which tools or workflows to run, while verification agents review outputs and reasoning.

In combination, these agents can execute multi-step processes from start to finish, explicitly carrying context, intermediate outputs, and decisions from one step to the next, so that the overall workflow remains consistent and traceable.

The risksassociated with AI

Modern AI systems introduce a series of risks that depend on how they generate results and, in some cases, how they are designed to act autonomously. To facilitate the assessment and control of these risks, it is useful to group them into categories that better reflect where failures are most likely to arise. The types of risk described below provide a practical way to validate the framework and ensure that testing remains proportionate to how the system is built, how it behaves, and how it is used.

Design and Implementation Risk

Design and implementation risk describes situations where weaknesses in the way the system is built or configured manifest as unsafe or unintended behaviors at runtime. This includes improper autonomy configurations, faulty workflow design, inadequate tooling, flaws in architecture design, or improper configuration of prompts and guardrails, all of which can lead to results that are out of intended behavior or in accordance with policies.

Core risk

Core risks are common to both generative AI models and agents and reflect fundamental failure modes inherent in data-driven probabilistic systems. These risks arise regardless of system autonomy or tool use and therefore constitute the basic risk layer applicable to all AI deployments.

The main risk categories typically include:

Risk of factual integrity of unsubstantiated, unverifiable, false or fabricated statements.
Integrity of reasoning Risk of causal gaps, faulty logic, missing steps, or incoherent reasoning.
Consistency: Risk of contradictory or internally inconsistent results between responses or executions.
Stability and drift risk, which changes behavior during executions, model updates, or small input variations.
User Overconfidence Risk (governance) in which a fluid narrative leads users to trust AI results without adequateoversight.

Agent-specific risk

Agentic AI introduces additional risks due to its workflow-oriented and goal-oriented nature and its ability to act with greater autonomy. Unlike purely generative systems, these risks arise from the system's ability to plan, make intermediate decisions, invoke tools, and execute actions with limited human intervention.

As a result, Agent AI gives rise to additional risk categories, including:

Planning integrity: Risk of invented, irrelevant or unsafe steps in the plans generated.
Workflow consistency: Risk of incorrect sequencing, dependency errors, or step logic.
Tool use security risk: due to unsafe or incorrect selection of tools/APIs, parameters or misuse.
Integrity Risk: Corrupted or contaminated intermediate states along the steps.
Risk to recovery integrity due to erroneous source selection, incorrect substantiation, or unstable recovery behavior.
Auditability and traceability: Risk that plans, reasoning, or interactions with tools cannot be reproduced or traced.
Protection and autonomy: Risk that the agent exceeds the permitted autonomy, circumvents restrictions or performs unsafe actions.

The scale and complexity of these risks, compared to those of a traditional predictive model, require the design of an improved model validation framework.

Marc of validation of AI systems

The complexities of AI systems introduce behavioral risks that traditional validation frameworks are not designed to address. For this reason, AI validation frameworks require a set of additional complementary components:

Data quality and security: This step ensures that the AI system receives secure, complete, and policy-compliant inputs before any validation begins. For Generative and Agent systems, inputs include prompts, conversation history, retrieved tests, and system instructions.
Behavioral testing: which assesses whether generative AI systems and AI agents behave with proper discipline and control in practice. This includes how consistently the system reasons, how reliably it bases results on available evidence, how it responds when information is missing or contradictory, and whether safety barriers remain effective over time. For systems with agent capabilities, behavioral testing also considers autonomy limits, routing decisions between components, and the safe use of tools.
Outcome evaluation: which reviews the quality of what the model produces: relevance, completeness, factual accuracy, clarity, tone, and the degree of human refinement required.
Verification that the "Human-in-the-Loop" (HITL) is applied after the evaluationof resultsto incorporate human judgment for high-impact results where responsibility cannot be delegated to AI.
Continuous monitoring: Provides continuous tracking of drift, hallucination patterns, recovery failures, planning instability, and other behavioral changes over time.
Remediation mechanism: where, even with thorough controls, generative AI systems and AI agents require continuous remediation due to their dynamic nature. Problems can arise at any stage, so remediation acts as a loop with continuous feedback where weaknesses trigger specific adjustments such as prompt refinement, model tuning, and safety barrier updates, ensuring that the system remains stable, secure, and aligned with validation expectations.

All components of the framework apply to both generative AI systems and AI agents. When a system introduces autonomy or tool use, the behavioral testing component is more strictly enforced, with additional checks to address the risks these capabilities pose. The same assurance workflow is applied consistently throughout the lifecycle, with remediation being triggered whenever validation findings, output issues, or monitoring signals indicate the need for corrective actions.

Data Quality and Security checks assess whether entries are complete, well-formed, relevant to the intended task, and comply with internal policies and usage restrictions, ensuring that entries do not contain prohibited, unsafe, or inappropriate content, or solicit actions or access outside the permitted scope of the system.

Behavioral testing focuses on whether an AI system behaves safely, predictably, and consistently under different conditions, rather than assessing the quality of an individual result in isolation. This includes evaluating their reasoning, the reliability of their grounding in the available evidence, the consistency of rejection behavior when information is missing or contradictory, and when agentic capacities are present, how the system plans, sequences actions, and uses tools to advance defined goals.

Behavioral testing is applied under a variety of controlled stress conditions, such as incomplete information, contradictory evidence, repeated executions, or adversarial pressure. These conditions do not define pass or fail results in and of themselves. Instead, they are used to identify behavioral weaknesses and distinguish isolated exit problems from systematic behavioral risks that can only arise under stress.

In more complex architectures, behavioral risk can arise not only within a single decision flow, but also from interactions between multiple agents. When using multi-agent systems, behavioral testing extends to evaluating agent transfers, routing decisions, coordination between agents, and the stability of results between shared workflows.

Implementation decisions such as document fragmentation strategies, metadata design, and access controls are not treated as separate validation pillars. Its relevance arises through its behavioral impact. When these design decisions materially affect performance, they are explicitly evaluated through behavioral testing and outcome evaluation.

Scaling behavioral tests based on system complexity

Behavioral tests are applied proportionally. The depth and breadth of the behavioural tests are adjusted to the autonomy and risk profile of the system:

Basic tests apply to all generative and agentic AI systems. These should confirm that the reasoning is logical and evidence-based, detect hallucinations and behavioural drifts, assess stability over repeated executions, and verify that safety barriers trigger safe rejects when inputs are incomplete, contradictory or out of range.

Dependent tests are applied when using Retrieval Augmented Generation (RAG). These assess the integrity of recovery, ensuring that the correct sources are selected, properly cited, used without invention, and that recovery behavior remains stable across executions.
AI agent dependent tests evaluate whether the system selects and invokes tools appropriately and within allowable limits, follows the correct routing and scaling paths, detects made-up or irrelevant steps, and maintains consistent workflows.
Reinforcement tests are introduced for systems with higher risk or autonomy capacity. These include adversarial stress testing, regulatory alignment checks, causal consistency, and confidentiality checks to ensure that sensitive informationis not disclosed in situations under pressure.

Outcome Evaluation focuses on the quality, foundation, integrity, and professionalism of individual results.

For the agéntic AI systems, the evaluation also includes the safety and adequacy of the proposed actions or workflows. The level of human refinement required serves as a practical indicator of the reliability of the output.

To ensure that the AI-generated narrative is not only secure, but also analytically usable, each result must undergo a set of specific quality checks that assess its relevance, clarity, accuracy, and professional readiness:

Relevance assessment: Confirms that the narrative directly addresses the objective, question, or analytical requirement, detecting potential deviations.
Structural Clarity and Consistency Check: Assesses whether the output is easy to follow, logically ordered, and free of ambiguity.
Factual accuracy review: Ensures that all claims are correct, verifiable, and evidence-based. Any unsubstantiated claims indicate a grounding failure.
Completeness Scan: Checks if the narrative covers all required elements without omissions.
Tone and professionalism check: Confirms a neutral tone suitable for regulatory and senior management environments.
Editing Effort Score: Measures the level of human correction needed, identifying quality issues.

In practice, institutions perform Outcome Assessment through a combination of automated routines and structured human review, with a clear distinction between mechanical checks and those that require judgment.

Mechanical checks are used when an objective comparison is possible. For example, checking whether factual claims are supported by retrieved evidence, checking for consistency with known baseline data, confirming that there are required sections, or detecting obvious out-of-scope content can be done automatically and consistently and at scale.

Human judgment is applied when the evaluation depends on context, subtlety, or intended use. This includes assessing whether the reasoning is sufficiently clear and persuasive, whether the narrative adequately addresses conflicting evidence, whether the tone and framework are appropriate for regulatory or senior management audiences, and whether the outcome is suitable for analysis or oversight.

HITL introduces explicit human judgment as a formal checkpoint before results are trusted, ensuring that the responsibility for high-impact decisions remains with experts and not the AI system.

The HITL revision is not applied by default. It is only activated when the exits are considered to have a material impact or sensitive, when predefined risk thresholds are exceeded or when the ambiguity remains unresolved after automatic checks. Typical examples include outcomes that influence relevant financial decisions, regulatory reporting, senior management decisions, or changes in key inputs.

The goal of HITL is to maintain clear human accountability, avoid over-reliance on AI in material decisions, and provide a safeguard against residual errors of reasoning before results are formally adopted, allowing monitoring and remediation activities to continue throughout the broader AI lifecycle.

Generative and agentic AI systems operate in dynamic environments where inputs, usage patterns, and context evolve over time. Continuous Monitoring provides continuous monitoring to ensure that the behavior of the system remains within the limits set during validation. It acts as a complement to formal validation, by detecting behavioural drift under real operating conditions.

In practice, monitoring tracks a defined set of behavioral metrics, such as rates of unsubstantiated assertions, changes in reasoning patterns, retrieval stability, and rejection behavior under incomplete or contradictory inputs. These metrics are evaluated based on predefined ranges and thresholds that reflect the institution's risk appetite, with clear distinctions between acceptable behavior, emerging concern, and unacceptable deviation.

Continuous monitoring aims to detect when behavior begins to move outside the established ranges under real operating conditions. Monitoring is carried out at defined intervals and after material changes in prompts, underlying models, recovery settings, autonomy settings or execution context.

Continuous Monitoring extends established model risk management practices to account for the dynamic and adaptive nature of modern AI systems. It provides confidence that validation conclusions remain reliable over time, while ensuring that behavioural changes are detected early and addressed before they have a material impact.

Even with robust controls, AI systems will require periodic corrections. Behavioral variability, recovery dependencies, and autonomous execution processes mean that problems can arise at any point in the lifecycle. Therefore, the remediation mechanism operates as a continuous feedback loop, ensuring that every weakness identified from data receipt to post-deployment monitoring leads to targeted and traceable adjustments.

In the input phase, failed data quality and security checks (e.g., insecure content, incomplete entries, inconsistent retrieval evidence) trigger remediation through updated prompt restrictions, rules, or enhanced retrieval settings, to ensure secure entries aligned with internal policies before validation moves forward.
During model validation, behavioral findings are directly matched with corrective actions. Hallucination or drrift signals require rapid refinement; stability issues may require adjustments to model choice; failure to retrieve leads to improved scoring; and unsafe tooling or autonomy behaviors are corrected through revised tool permissions, alternate paths, or adjustments to the tool. Step limit. These allow thepredictable and auditable behavior of AI systems to be restored.
After the evaluation of the results, remediation focuses on improving narrative quality. High editing effort, missing reasoning steps, or unclear structure are addressed by refining examples, instructions, and settings within prompts, ensuring that results meet analytical and supervisory expectations prior to HITL review.
Within HITL, repeated human corrections become explicit signals of remediation. Continuous overrides and escalations inform the need for prompt updates, barriers or autonomy limits, so that manually treated problems do not reappear in future results.
In continuous monitoring, alerts for drift, recovery instability or changes introduced by model updates automatically trigger remediation systems. These include meta-prompt updates, iterative regression testing, and realignment of reject logic.

At all stages, remediation depends on a consistent set of control levers, including quick tuning, model selection and tuning, recovery improvements, security updates, tool usage settings, and refinements to autonomy rules.

Each failed test or observed anomaly is mapped to one or more of these controls and addressed through specific corrective actions.

Remediation is explicitly based on risks and evidence, rather than scenarios or judgments. Issues are not considered solved only by manual override or subjective approval. Resolution is only closed once corrective actions have already beenimplemented, and revalidation confirms that the underlying behavioral risk is no longer reproduced within defined thresholds. This approach ensures that remediation strengthens the system in a lasting way, prevents recurrence under similar conditions, and maintains a clear audit (findings, actions, and results).

Next steps for institutions

As AI systems are integrated into decision-making and reporting, institutions must ensure that they behave predictably and produce verifiable results. Implementing a Model Risk Management framework with a well-calibrated AI validation approach will support the development of more robust, efficient, and reliable AI systems.

Companies that investin good governance, accountability, and an independent challenge will reap clear benefits: better performance, explainability, audit readiness, and fewer incidents. The integration of behavioral testing, HITL monitoring, continuous monitoring, and robust remediation mechanisms will enable a secure and scalable adoption of AI to future-proof business models.

Authors

Daniel Fernández Domínguez
Risk & Analytics partner View Profile
Lukas Majer
Lukas is a director and head of quantitative risk in Spain. He has extensive experience in both traditional and novel risk modelling, covering credit Risk, interest rate risk, market risk, climate risk, and AML risk. His expertise spans the full risk lifecycle across all three lines of defence. He has worked with both banks and regulators, providing him with a comprehensive perspective on supervisory expectations and industry best practice. View Profile
Dwayne Price
Dwayne joined Grant Thornton in 2016 as partner to head our Regulatory Advisory Services offering. Dwayne leads the firm’s risk and regulatory offering focusing on Quantitative Risk Advisory, Regulatory Risk and Strategy, and Regulatory Support services. View Profile
Jonathan Fitzpatrick
Jonathan is a partner in advisory and head of quantitative risk. Jonathan specialises in delivering risk measurement and quantification services to clients. He has over 16 years’ experience, working in both the UK and Ireland, where he led teams across professional services firms and in multinational banks. View Profile
Juan García
Senior manager

Search dialog

Validation of artificial intelligence systems

Introduction

The new generation of AI systems

The risksassociated with AI

Design and Implementation Risk

Core risk

Agent-specific risk

Marc of validation of AI systems

Next steps for institutions

TAGS

Authors

Daniel Fernández Domínguez

Lukas Majer

Dwayne Price

Jonathan Fitzpatrick

Juan García

ABOUT US

CONNECT

LEGAL

Search dialog

Validation of artificial intelligence systems

Introduction

The new generation of AI systems

Generative AI

Agentic AI

The risksassociated with AI

Design and Implementation Risk

Core risk

Agent-specific risk

Marc of validation of AI systems

01 Data quality and security

02 Behavioral Testing

03 Evaluation of the output

04 Human-in-the-Loop (HITL)

05 Continuous monitoring

06 Remediation mechanisms

Next steps for institutions

TAGS

Share this page

Authors

Daniel Fernández Domínguez

Lukas Majer

Dwayne Price

Jonathan Fitzpatrick

Juan García