Technical Whitepaper

How to Validate AI Batch Monitoring Under GAMP 5

February 2026 · 12 min read

A practical framework for pharmaceutical manufacturers deploying AI-powered anomaly detection in GMP-regulated batch production, covering Category 5 classification, IQ/OQ/PQ, model versioning, Annex 22 alignment, and human-on-the-loop governance.

Author: Vilmer Frost, BatchCortex

Published: February 2026

Version: 1.0

Contact: vilmer@batchcortex.com

Abstract

The deployment of machine learning models in pharmaceutical batch manufacturing is no longer a theoretical future state, it is a present regulatory challenge. EU GMP Annex 22, the GAMP 5 AI Guide (2025), and the FDA's January 2025 draft guidance on AI credibility assessments have collectively created a concrete compliance framework for AI in GMP environments. This whitepaper provides a structured, practitioner-focused guide to validating AI batch monitoring systems under GAMP 5, with specific focus on anomaly detection models used in real-time process monitoring. Each section maps directly to the documentation and validation activities required to deploy such a system in a regulated EU or US pharmaceutical manufacturing environment.

1. Why AI Systems Fall Into GAMP 5 Category 5

GAMP 5, Good Automated Manufacturing Practice, Second Edition (2022), classifies computerised systems into four software categories based on complexity and novelty. Understanding where an AI batch monitoring system falls within this taxonomy is not an academic exercise. The category determines the depth of validation documentation required, the testing protocols expected, and the regulatory scrutiny that follows.

1.1 The Four GAMP Software Categories

Category 1: Infrastructure software

Examples: Operating systems, databases

Validation burden: Low, vendor verification only

Category 2: Non-configured products

Examples: Spreadsheets, standard office tools

Validation burden: Low to medium

Category 3: Configured products

Examples: Standard LIMS, ERP, MES with configuration

Validation burden: Medium, configuration testing

Category 4: Custom or bespoke software

Examples: Custom AI models, novel algorithms, ML pipelines

Validation burden: High, full IQ/OQ/PQ lifecycle

An AI anomaly detection system such as an Isolation Forest model, LSTM autoencoder, or any trained ML model used in real-time batch monitoring falls unambiguously into Category 5. The GAMP AI Guide (July 2025) is explicit on this point.

1.2 Three Reasons AI Is Always Category 5

Reason 1: Non-deterministic behaviour

Traditional validated software produces the same output for the same input, every time. Machine learning models are statistically trained systems. Their outputs depend on training data, hyperparameters, random seeds, and the specific model version in deployment. This requires a validation approach designed for probabilistic systems.

Reason 2: Custom training data creates unique system behaviour

A Category 4 system behaves according to vendor specifications. An AI anomaly detection model trained on your batch data is a bespoke system. Its behaviour is unique to its training data, and your validation team owns it.

Reason 3: The GAMP AI Guide (2025) says so explicitly

AI and ML systems used in GMP-impacting decisions require documented training data provenance, model architecture documentation, performance metrics on representative test sets, change control for every model update, and continuous drift monitoring post-deployment.

Key regulatory position: The FDA's January 2025 draft guidance on AI credibility assessments and EMA/FDA joint principles (January 14, 2026) align with GAMP Category 5 classification for AI systems with GMP critical decision-making roles.

1.3 What Category 5 Classification Means in Practice

A full validation lifecycle must be documented before production deployment.
All validation activities must be approved and signed by appropriate personnel.
Every model update, retraining, hyperparameter update, or architecture change requires formal change control.
The system must be continuously monitored for performance drift.
Model retirement must be formally documented.

For AI batch monitoring, the validation package must cover both the software infrastructure and the trained model artifact itself.

2. URS Considerations for AI Anomaly Detection

The User Requirements Specification is the foundation of the GAMP 5 lifecycle. For AI anomaly detection, requirements must capture probabilistic performance and governance constraints instead of only deterministic pass or fail behaviour.

2.1 What Makes an AI URS Different

AI system requirements cannot be fully expressed as deterministic outcomes. The URS should define acceptable performance ranges and qualification criteria using metrics like sensitivity, specificity, and detection latency.

Critical distinction: A traditional requirement states exact deterministic behaviour. An AI requirement defines minimum validated performance on a representative qualification dataset.

2.2 Essential URS Requirements for AI Batch Monitoring

Process scope and boundary conditions

Defined process types, sensor parameters, and unit ranges.
Included and excluded batch types in scope.
Normal operating ranges and proven acceptable ranges by parameter.
Alert limits and action limits separated from system limits.

Performance requirements (probabilistic)

Minimum sensitivity and minimum specificity targets.
Maximum detection latency after anomaly onset.
Drift detection sensitivity requirements.
Minimum confidence threshold for alert generation.

Explainability requirements

Every alert includes a clear rationale and triggering sensors.
Root cause hypothesis with confidence and evidence.
No unexplainable recommendation enters QA review workflow.
AI recommendations remain understandable to qualified personnel.

Data integrity requirements

Server-side timestamps for all sensor inputs.
Raw values preserved before transformation.
Model version logged with each detection event.
Append-only immutable audit trail controls.

Human oversight requirements

No autonomous production action without human approval under Annex 22.
QA sign-off required before final deviation completion.
QP release authority cannot be delegated to AI.
Qualified personnel can dismiss or escalate every alert.

2.3 URS Traceability

Every URS requirement should map forward to FRS requirements and OQ test scripts in a traceability matrix.

URS-001: Real-time monitoring for temperature, pressure, RPM (FRS-004, OQ-T-001, Planned)
URS-002: Sensitivity greater than or equal to 85% (FRS-012, OQ-T-008, Planned)
URS-003: Model version logged per event (FRS-017, OQ-T-015, Planned)
URS-004: Human approval required before production action (FRS-021, OQ-T-019, Planned)
URS-005: Append-only, tamper-evident audit trail (FRS-025, OQ-T-023, Planned)

3. FRS Implications for Explainability

For AI systems, the FRS should specify the AI to human interface outputs rather than undocumented internal model mechanics.

3.1 The Explainability Imperative

Annex 22 requires outputs that are understandable to qualified personnel. Explainability is a compliance requirement, not a UX enhancement.

Regulatory principle: Explainable output protects both manufacturer and supplier in inspection and product liability contexts.

3.2 FRS Requirements for Explainability

Alert output specification

Parameter, threshold, and timestamp on every alert.
Clear distinction between hard limit breach, drift, and multivariate anomaly.
Confidence score display per alert.
Detection method and model version identifier included.

Root cause output specification

Primary hypothesis in plain language.
Supporting sensor evidence and baseline deltas.
Hypothesis confidence score.
Alternative hypotheses when confidence is low.
Recommended action clearly marked as recommendation only.

3.3 Explainability Architecture Options

Isolation Forest: Feature scores and ranking, low regulatory risk, best for real-time multivariate detection.
LSTM Autoencoder: Reconstruction error and attention weights, medium regulatory risk, best for temporal sequence anomalies.
Random Forest: Feature importance and decision tracing, low regulatory risk, best for classification tasks.
Deep Neural Network: SHAP or LIME approximation, medium to high regulatory risk, best for complex pattern recognition.
Pure LLM: Prompt and response visibility only, high regulatory risk, best limited to report generation.

For many GMP monitoring use cases, Isolation Forest provides a practical explainability profile that aligns with validation needs.

4. IQ/OQ/PQ for ML Models

4.1 Installation Qualification (IQ)

IQ confirms correct installation of infrastructure and model artifacts.

Infrastructure IQ checkpoints

OS and environment versions verified and documented.
Library versions documented and controlled.
Database schema version aligned to design specification.
Network segmentation and read-only connectivity verified.
User roles and access matrix validated.

Model IQ checkpoints

Deployed model file matches recorded SHA256 hash.
Version identifier and metadata completeness verified.
Startup load path verified for expected model artifact.
Known-input inference sanity check passes.

IQ key principle: Hash verification proves the deployed model is exactly the validated model.

4.2 Operational Qualification (OQ)

OQ demonstrates that the system operates as specified across the defined operating range.

Qualification datasets must be representative, independent from training data, labelled by qualified SMEs, and governed by a data qualification protocol.

OQ-T-001: Hard limit temperature breach. Expected alert within one reading. Pass: 100% detection and less than 2 reading latency.
OQ-T-002: Multivariate anomaly detection. Expected alert on known anomalous combination. Pass: sensitivity greater than or equal to 85%.
OQ-T-003: False positive rate on normal batches. Expected no alerts on normal data. Pass: specificity greater than or equal to 90%.
OQ-T-004: Drift detection. Expected alert before hard limit. Pass: at least 5 readings before breach.
OQ-T-005: Confidence score accuracy. Expected high confidence for clear positives. Pass: confidence greater than 80%.
OQ-T-006: Model version logging. Expected version in every event. Pass: 100% record completeness.
OQ-T-007: Report generation completeness. Expected all required fields. Pass: 100% field completeness.
OQ-T-008: Part 11 signature controls. Expected re-authentication and identity capture. Pass: all controls present.
OQ-T-009: Audit trail immutability. Expected modification blocked. Pass: tampering attempt fails.
OQ-T-010: Data gap handling. Expected graceful alert or degradation event. Pass: no silent failure.

4.3 Performance Qualification (PQ)

PQ confirms stable performance in live production conditions over time.

Run AI monitoring in parallel with current process controls.
Use a minimum three-month or fifty-batch qualification window.
Compare monthly PQ metrics to OQ baseline.
Trigger investigation when degradation exceeds 5%.
Require QA Director and QP sign-off for PQ completion.

5. Model Versioning and Change Control

In GMP AI systems, retraining alone is a controlled change. A model update without change control is equivalent to deploying an unvalidated system.

5.1 What Constitutes a Model Change

Retraining with new data, same architecture: change control is required; scope is OQ performance re-verification.
Hyperparameter changes: change control is required; scope is full OQ rerun.
Algorithm change: change control is required; scope is full IQ/OQ/PQ lifecycle.
Threshold sensitivity change: change control is required; scope is affected OQ rerun.
Inference pipeline bug fix: minor change control required; scope is targeted OQ regression.
Infrastructure runtime update: minor change control required; scope is IQ check and OQ regression.
Routine operation with no model or code change: no change control required; scope is continuous monitoring only.

5.2 The Model Version Control System

Recommended version convention: [Algorithm]-v[Major].[Minor].[Patch], for example IF-v2.1.4.

Major: algorithm or architecture change.
Minor: meaningful retraining or hyperparameter revision.
Patch: bug fix or minor controlled update.

Each deployed model must include metadata for version, training dataset, hyperparameters, qualification metrics, hash, change control reference, approval signatures, and deployment details.

5.3 Model Retirement

Retired models are archived, not deleted.
Archives are searchable by version and date range.
Every batch record maintains model-version traceability.
Retention follows GMP batch record policy requirements.

5.4 Predetermined Change Control Plans (ICH Q12 Alignment)

PCCPs can predefine low-risk change boundaries to reduce operational burden while preserving validation control and QA accountability.

6. Annex 22 Alignment

Annex 22 defines practical expectations for transparency, control, and accountability in GMP AI deployments.

6.1 The Six Pillars of Annex 22 Compliance

Pillar 1: Data governance

Documented source, date range, and preprocessing history.
Representative data across intended operating conditions.
ALCOA+ integrity controls aligned with GMP expectations.

Pillar 2: Model development and validation

Documented algorithm selection rationale.
Defined and justified train-validation-test splits.
Predefined metrics and acceptance criteria.

Pillar 3: Deployment controls

Fixed production model version without auto-updates.
Controlled deployment by qualified personnel.
Rollback procedure documented and tested.

Pillar 4: Human oversight

Critical GMP decisions remain human approved. AI supports analysis, but final release or rejection authority remains with qualified personnel.

Annex 22 on autonomy: Human-on-the-loop is the compliant architecture for GMP critical decisions.

Pillar 5: Monitoring and continuous assurance

Periodic performance review and drift monitoring.
Defined triggers for investigation and revalidation.

Pillar 6: Documentation and audit readiness

Complete event-level audit trails with version traceability.
Change control history for every model update.
Human review evidence for each quality-impacting decision.

6.2 Annex 22 Alignment Mapping

Training data governance: source and preprocessing metadata tracking (Evidence: model metadata record).
Documented development methodology: per-version training protocol (Evidence: ML training protocol).
Validation before deployment: qualification dataset OQ verification (Evidence: OQ qualification report).
Model version control: semantic versioning and hash control (Evidence: model version register).
Human oversight: AI recommends, QA approves, QP releases (Evidence: sign-off workflow specification).
Explainability: confidence and root cause evidence per alert (Evidence: FRS explainability requirements).
Continuous assurance: monthly performance review process (Evidence: continuous assurance protocol).
Audit trail completeness: append-only ALCOA+ event log controls (Evidence: audit trail design specification).

7. Human-on-the-Loop Governance

Human-on-the-loop means AI performs analysis and recommendation, but all GMP critical outcomes require explicit human approval.

7.1 Why Human-on-the-Loop Is the Right Architecture

Regulatory reason

EU AI Act oversight requirements and Annex 22 compliance demand actionable human oversight for high-risk AI.

Technical reason

Model error is non-zero in real production. Human review prevents false positives from becoming formal GMP records.

Commercial reason

Human-on-the-loop is the deployable and saleable operating model for pharmaceutical quality organizations in 2026.

7.2 Governance Framework Design

Operator: views alerts and process context; monitoring-only authority; acknowledge-only rights.
QA Manager: reviews AI reports and signs deviations; deviation record authority; can dismiss with rationale.
QA Director: approves escalated decisions; escalated quality authority; can override with documentation.
Qualified Person (QP): reviews complete record and release status; legal release authority; cannot delegate to AI.
Validation Engineer: deploys controlled model updates; validation authority; configuration changes via change control.

Decision flow remains controlled from anomaly detection through QA review, QP decision, and immutable record sealing.

7.3 Governance Failure Modes and Mitigations

Alert fatigue: risk is dismissals without review; mitigation is specificity tuning and monthly rate monitoring.
Rubber stamping: risk is nominal oversight; mitigation is review-time controls and outlier audits.
QP bottleneck: risk is release delays; mitigation is optimized sign-off process and delegation policy.
Model staleness: risk is rising false negatives; mitigation is periodic drift review and retraining triggers.
Undocumented overrides: risk is weak record defensibility; mitigation is mandatory rationale capture on every dismissal.

7.4 Demonstrating Governance to Regulators

Role matrix and decision ownership.
Complete audit trail and sign-off history.
Periodic performance reports and drift reviews.
Model version register and latest change control evidence.

Inspector question: Show a real batch where AI flagged an anomaly and QA dismissed it, including identity, timestamp, and rationale in the audit trail.

Conclusion

Validating AI batch monitoring under GAMP 5 requires probabilistic specifications, explainable outputs, model lifecycle controls, and continuous assurance after deployment.

Define measurable performance targets before training.
Build explainability into core architecture.
Treat model updates as controlled validated changes.
Design human oversight as a structural requirement.
Operate continuous monitoring as part of validation.

Teams that operationalize this framework now will be better prepared for inspections, faster deployment cycles, and stronger quality outcomes.

About BatchCortex

BatchCortex is a GMP batch intelligence platform for pharmaceutical manufacturers, combining AI anomaly detection, deviation report drafting, and Part 11 compliant sign-off in an Annex 22 ready architecture.

Learn more on the BatchCortex homepage and review our compliance overview.

batchcortex.com | vilmer@batchcortex.com | Stockholm, Sweden

Ready to validate AI under GAMP 5?

Apply for a pilot slot and prepare your Annex 22 aligned rollout.

Apply for Pilot

← Back to blog