Validation of artificial intelligence systems

Risk advisory
Validating AI systems becomes essential as generative and agentic AI models are integrated into critical processes. These technologies already influence decision-making, flow automation and the generation of analyses, which forces you to strengthen control, traceability and compliance to manage your risk properly.
Contents

Introduction

Artificial intelligence is transforming the way organizations operate, reshaping processes, accelerating analysis, and enabling new ways of generating insights. As AI systems move from experimentation to core infrastructure, institutions must establish robust risk management frameworks to ensure effective monitoring of model performance through validation, monitoring, and accountability. Existing model risk practices must evolve to address the behavioral and operational characteristics of both generative  and agentic AI models, while maintaining trust and regulatory compliance to support sound decision-making.

Regulatory expectations are converging between EU and UK jurisdictions, requiring AI systems to meet standards of explainability, traceability, governance, and human oversight. In the following article we propose a model risk framework designed to meet these expectations by integrating behavioral assurance, auditability, and control mechanisms, enabling institutions to scale AI models securely and with confidence.

The framework presented here follows a structured approach to comprehensive validation  that encompasses data quality and security, behavioral testing, outcome evaluation,  human in the loop, continuous monitoring, and remediation mechanisms.

 

The new generation of AI systems

Machine learning models have traditionally focused on statistical prediction, estimation of probabilities, classification of results or detection of patterns in structured data. These models typically produce numerical results and are primarily evaluated using quantitative performance metrics. However, modern AI systems go beyond predictive modeling. They combine LLM models with capabilities such as reasoning, information retrieval, planning, and interaction with external tools.

As a result, these systems behave more like analytical assistants than traditional statistical models. For the purposes of this article, we focus on these AI systems, which typically consist of two key components:

  • Generative AI: Systems used to produce reasoning and narrative explanation.
  • Agentic AI: systems that can pursue goals, make decisions in several steps, use tools and act autonomously.

Generative AI represents the foundational layer of modern AI capabilities. These models stand out for producing coherent and high-quality narratives: summarizing  documents, reorganizing evidence, rewriting notes, and structuring complex behaviors into clear explanations.

Their strengths lie in interpretation and documentation, not in performing tasks such as performing calculations or inferring lost data. They improve human efficiency, but they are not a substitute for human judgment.

Agentic AI  extends generative AI by introducing the ability  to design structured plans, identify tools, and execute sequential tasks, replacing human intervention. This introduces additional complexities from a model risk management perspective, including setting boundaries, governing tools, and checking the accuracy of results.

 

Thus, while generative AI and agentic  AI are distinct types of systems, for validation purposes generative AI can be treated as the zero-autonomy baseline to which decision-making and agetic execution capabilities are added.

Several components of agent systems commonly appear in financial services workflows, each entailing distinct behaviors and risks that need to be validated. These include:

  • Planning agents that break down tasks into structured steps and sequence actions.
  • Recovery agents that locate information from documents or databases.
  • Agents that use tools and interact with calculators, APIs, or internal systems to perform actions.
  • Orchestration agents that decide which tools or workflows  to run, while verification agents review outputs and reasoning.

In combination, these agents can execute multi-step processes from start to finish, explicitly carrying context, intermediate outputs, and decisions from one step to the next, so that the overall workflow remains consistent and traceable.

 

The risksassociated with AI

Modern AI systems introduce a series of risks that depend on how they generate results and, in some cases, how they are designed to act autonomously. To facilitate the assessment and control of these risks, it is useful to group them into categories that better reflect where failures are most likely to arise. The types of risk  described below provide a practical way to validate the framework and ensure that testing remains proportionate to how the system is built, how it behaves, and how it is used.

 

Design and Implementation Risk

Design and implementation risk describes situations where weaknesses in the way the system is built or configured manifest as unsafe or unintended behaviors at runtime. This includes improper autonomy configurations, faulty workflow design, inadequate tooling, flaws in architecture design,  or improper configuration of prompts and guardrails, all of which can lead to results that are out of intended behavior or in accordance with policies.

 

Core risk

Core risks are common to both generative AI models and agents and reflect fundamental failure modes inherent in data-driven probabilistic systems. These risks arise regardless of system autonomy or tool use and therefore constitute the basic risk layer applicable to all AI deployments.

 

The main risk categories typically include:

  • Risk of factual integrity of unsubstantiated, unverifiable, false or fabricated statements.
  • Integrity of reasoning Risk of causal gaps, faulty logic, missing steps, or incoherent reasoning.
  • Consistency: Risk of contradictory or internally inconsistent results between responses or executions.
  • Stability and drift risk, which changes behavior during executions, model updates, or small input variations.
  • User Overconfidence  Risk (governance) in which a fluid narrative leads users to trust AI results without adequateoversight.

 

Agent-specific risk

Agentic AI  introduces additional risks due to its workflow-oriented and goal-oriented nature and its ability to act with greater autonomy. Unlike purely generative systems, these risks arise from the system's ability to plan, make intermediate decisions, invoke tools, and execute actions with limited human intervention.

As a result, Agent  AI gives rise to additional risk categories, including:

  • Planning integrity: Risk of invented, irrelevant or unsafe steps in the plans generated.
  • Workflow consistency: Risk of incorrect sequencing, dependency errors, or step logic.
  • Tool use security risk: due to unsafe or incorrect selection of tools/APIs, parameters or misuse.
  •    Integrity Risk: Corrupted or contaminated intermediate states along the steps.
  • Risk to recovery integrity due to erroneous source selection, incorrect substantiation, or unstable recovery behavior.
  • Auditability and traceability: Risk that plans, reasoning, or interactions with tools cannot be reproduced or traced.
  • Protection and autonomy: Risk that the agent exceeds the permitted autonomy, circumvents restrictions or performs unsafe actions.

The scale and complexity of these risks, compared to those of a traditional predictive model, require the design of an improved model validation framework.

 

Marc of validation of AI systems

The complexities of AI systems introduce behavioral risks that traditional validation frameworks are not designed to address. For this reason, AI validation frameworks  require a set of additional complementary components:

  1. Data quality and security: This step ensures that the  AI  system receives secure, complete, and policy-compliant inputs before any validation begins. For Generative and Agent systems, inputs include prompts, conversation history, retrieved tests, and system instructions.
  2. Behavioral testing: which assesses whether generative AI systems and   AI agents behave with proper discipline and control in practice. This includes how consistently the system reasons, how reliably it bases results on available evidence, how it responds when information is missing or contradictory, and whether safety barriers remain effective over time. For systems with agent capabilities, behavioral testing also considers autonomy limits, routing decisions between components, and the safe use of tools.
  3. Outcome evaluation: which reviews the quality of what the model produces: relevance, completeness, factual accuracy, clarity, tone, and the degree of human refinement required.
  4. Verification that the "Human-in-the-Loop" (HITL) is applied after the evaluationof resultsto incorporate human judgment for high-impact results where responsibility cannot be delegated to AI.
  5. Continuous monitoring: Provides continuous tracking of  drift, hallucination patterns, recovery failures, planning instability, and other behavioral changes over time.
  6. Remediation mechanism: where, even with thorough controls, generative AI systems and  AI agents require continuous remediation due to their dynamic nature. Problems can arise at any stage, so remediation acts as a loop with continuous feedback where weaknesses trigger specific adjustments such as prompt refinement, model tuning, and safety barrier updates, ensuring that the system remains stable, secure, and aligned with validation expectations.

All components of the framework apply to both generative AI systems and AI agents. When a system introduces autonomy or tool use, the behavioral testing component is more strictly enforced, with additional checks to address the risks these capabilities pose. The same assurance workflow is applied consistently throughout the lifecycle, with remediation being triggered whenever validation findings, output issues, or monitoring signals indicate the need for corrective actions.

Data Quality and Security checks assess whether entries are complete, well-formed, relevant to the intended task, and comply with internal policies and usage restrictions, ensuring that entries do not contain prohibited, unsafe, or inappropriate content, or solicit actions or access outside the permitted scope of the system.

Behavioral testing focuses on whether an AI system behaves safely, predictably, and consistently under different conditions, rather than assessing the quality of an individual result in isolation. This includes evaluating their reasoning, the reliability of their grounding in the available evidence, the consistency of rejection behavior when information is missing or contradictory, and when agentic capacities are present, how the system plans, sequences actions, and uses tools to advance defined goals.

Behavioral testing is applied under a variety of controlled stress conditions, such as incomplete information, contradictory evidence, repeated executions, or adversarial pressure. These conditions do not define pass or fail results in and of themselves. Instead, they are used to identify behavioral weaknesses and distinguish isolated exit problems from systematic behavioral risks that can only arise under stress.

In more complex architectures, behavioral risk can arise not only within a single decision flow, but also from interactions between multiple agents. When using multi-agent systems, behavioral testing extends to evaluating agent transfers, routing decisions, coordination between agents, and the stability of results between shared workflows.

Implementation decisions such as document fragmentation strategies, metadata design, and access controls are not treated as separate validation pillars. Its relevance arises through its behavioral impact. When these design decisions materially affect performance, they are explicitly evaluated through behavioral testing and outcome evaluation.

Scaling behavioral tests based on system complexity

Behavioral tests are applied proportionally. The depth and breadth of the behavioural tests are adjusted to the autonomy and risk profile of the system:

  • Basic tests apply to all generative and agentic AI systems. These should confirm that the reasoning is logical and evidence-based, detect hallucinations and behavioural drifts, assess stability over repeated executions, and verify that safety barriers trigger safe rejects when inputs are incomplete, contradictory or out of range.
  • Dependent tests are applied when using Retrieval Augmented Generation (RAG). These assess the integrity of recovery, ensuring that the correct sources are selected, properly cited, used without invention, and that recovery behavior remains stable across executions.
  •  AI agent dependent tests  evaluate whether the system selects and invokes tools appropriately and within allowable limits, follows the correct routing and scaling paths, detects made-up or irrelevant steps, and maintains consistent workflows.
  • Reinforcement tests are introduced for systems with higher risk or autonomy capacity. These include adversarial stress testing, regulatory alignment checks, causal consistency, and confidentiality checks to ensure that sensitive informationis not disclosed in situations under pressure.

Outcome Evaluation focuses on the quality, foundation, integrity, and professionalism of individual results.

For the agéntic AI systems, the evaluation also includes the safety and adequacy of the proposed actions or workflows. The level of human refinement required serves as a practical indicator of the reliability of the output.

To ensure that the AI-generated narrative is not only secure, but also analytically usable, each result must undergo a set of specific quality checks that assess its relevance, clarity, accuracy, and professional readiness:

  • Relevance assessment: Confirms that the narrative directly addresses the objective, question, or analytical requirement, detecting potential deviations.
  • Structural Clarity and Consistency Check: Assesses whether the output is easy to follow, logically ordered, and free of ambiguity.
  • Factual accuracy review: Ensures that all claims are correct, verifiable, and evidence-based. Any unsubstantiated claims indicate a grounding failure.
  • Completeness Scan: Checks if the narrative covers all required elements without omissions.
  • Tone and professionalism check: Confirms a neutral tone suitable for regulatory and senior management environments.
  • Editing Effort Score: Measures the level of human correction needed, identifying quality issues.

 

In practice, institutions perform Outcome Assessment through a combination of automated routines and structured human review, with a clear distinction between mechanical checks and those that require judgment.

Mechanical checks are used when an objective comparison is possible. For example, checking whether factual claims are supported by retrieved evidence, checking for consistency with known baseline data, confirming that there are required sections, or detecting obvious out-of-scope content can be done automatically and consistently and at scale.

Human judgment is applied when the evaluation depends on context, subtlety, or intended use. This includes assessing whether the reasoning is sufficiently clear and persuasive, whether the narrative adequately addresses conflicting evidence, whether the tone and framework are appropriate for regulatory or senior management audiences, and whether the outcome is suitable for analysis or oversight.

HITL introduces explicit human judgment as a formal checkpoint before results are trusted, ensuring that the responsibility for high-impact decisions remains with experts and not the AI system.

The HITL revision is not applied by default. It is only activated when the exits are considered to have a material impact or sensitive, when predefined risk thresholds are exceeded  or when the ambiguity remains unresolved after automatic checks. Typical examples include outcomes that influence relevant financial decisions, regulatory reporting, senior management decisions, or changes in key inputs.

The goal of HITL is to maintain clear human accountability, avoid over-reliance on AI in material decisions,  and provide a safeguard against residual errors of reasoning before results are formally adopted, allowing monitoring and remediation activities to continue throughout the broader AI lifecycle.

Generative and agentic  AI systems operate in dynamic environments where inputs, usage patterns, and context evolve over time. Continuous Monitoring provides continuous monitoring to ensure that the behavior of the system remains within the limits set during validation. It acts as  a complement to formal validation, by detecting behavioural drift under real operating conditions.

In practice, monitoring tracks a defined set of behavioral metrics, such as rates of unsubstantiated assertions, changes in reasoning patterns, retrieval stability, and rejection behavior under incomplete or contradictory inputs. These metrics are evaluated based on predefined ranges and thresholds that reflect the institution's risk appetite, with clear distinctions between acceptable behavior, emerging concern, and unacceptable deviation.

Continuous monitoring aims to detect when behavior begins to move outside the established  ranges under real operating conditions. Monitoring is carried out at defined intervals and after material changes in prompts, underlying models, recovery settings, autonomy settings or execution context.

Continuous Monitoring extends established model risk management practices to account for the dynamic and adaptive nature of modern AI systems. It provides confidence that validation conclusions remain reliable over time, while ensuring that behavioural changes are detected  early and addressed before they have a material impact.

Even with robust controls, AI systems will require periodic corrections. Behavioral variability, recovery dependencies, and  autonomous execution processes mean that problems can arise at any point in the lifecycle. Therefore, the remediation mechanism operates as a continuous feedback loop, ensuring that every weakness identified from data receipt to post-deployment monitoring leads to targeted and traceable adjustments.

  • In the input phase,  failed data quality and security checks (e.g., insecure content, incomplete entries, inconsistent retrieval evidence) trigger remediation through updated prompt restrictions, rules, or enhanced retrieval settings, to ensure secure entries aligned with internal policies before validation moves forward.
  • During model validation, behavioral findings are directly matched with corrective actions. Hallucination or drrift  signals require rapid refinement; stability issues may require adjustments to model choice; failure to retrieve leads to improved scoring; and unsafe tooling or autonomy behaviors are corrected through revised tool permissions, alternate paths,  or adjustments to the tool. Step limit. These allow thepredictable and auditable behavior of  AI  systems to be restored.
  • After the evaluation of the results, remediation focuses on improving narrative quality. High editing effort, missing reasoning steps, or unclear structure are addressed by refining examples, instructions,  and settings within prompts, ensuring that results meet analytical and supervisory expectations prior to HITL review.
  • Within HITL, repeated human corrections become explicit signals of remediation. Continuous overrides and escalations inform the need for  prompt updates, barriers or autonomy limits, so that manually treated problems do not reappear in future results.
  • In continuous monitoring, alerts for drift, recovery instability or changes introduced by model updates automatically trigger  remediation systems. These include meta-prompt updates, iterative regression testing, and realignment of reject logic.

At all stages, remediation depends on a consistent set of control levers, including quick tuning, model selection and tuning, recovery improvements, security updates, tool usage settings, and refinements to autonomy rules.

Each failed test or observed anomaly is mapped to one or more of these controls and addressed through specific corrective actions.

Remediation is explicitly based on risks and evidence, rather than scenarios or judgments. Issues are not considered solved only by manual override or subjective approval. Resolution is  only closed once corrective actions have already beenimplemented, and revalidation confirms that the underlying behavioral risk is no longer reproduced within defined thresholds. This approach ensures that remediation strengthens the system in a lasting way, prevents recurrence under similar conditions, and maintains a clear audit (findings, actions, and results).

 

Next steps for institutions

As AI systems are integrated into decision-making and reporting, institutions must ensure that they behave  predictably and produce verifiable results. Implementing a Model Risk Management framework with a well-calibrated AI validation approach will support the development of more robust, efficient, and reliable AI systems.

Companies that investin good governance, accountability, and an independent challenge will reap clear benefits: better performance, explainability, audit readiness, and fewer incidents. The integration of behavioral testing, HITL monitoring, continuous monitoring, and robust remediation mechanisms will enable a secure and scalable adoption of AI to future-proof business models.