HD Copilot — Agentic Process Intelligence

From Anomaly to Action

How AI agents detect, reason, and act — reliably. A 5-stage pipeline for enterprise agentic process automation with calibrated confidence.

STAGE 1–2

Detect

→

STAGE 3

Reason

→

STAGE 4

Validate

→

STAGE 5

Execute

→

CONTINUOUS

Learn

Key Design Principle The LLM only touches Stages 3 and 4. It never owns the facts (Stages 1–2) and never directly executes system actions (Stage 5). This separation is what makes the system auditable.

Cases Analyzed

—

order traces

Events Mined

—

raw log entries

Anomalous Cases

—

process deviations

Max Dwell Gap

—

minutes

Annual Recoverable Value

Conservative

—

Upside

—

Pattern Distribution

Stage 1–2: Anomaly Detection & Root Cause

Deterministic process mining over raw event logs. No LLM involved. Every fact in the payload is a deterministic output from log data.

STEP 1

Ingest Events

ERP/WMS emits events with timestamps, resources, quantities. Events are grouped by case ID and sorted chronologically.

STEP 2

Build Process Graph

Extract activity sequences, compute transition counts. The process graph shows how cases actually flow — not how they should.

STEP 3

Conformance Check

Compare each case to the most frequent path. Detect loops, dwell gaps, quantity mismatches. Flag deviations with evidence.

STEP 4

Anomaly Payload

Package all signals as structured JSON — anomaly type, severity, evidence event IDs. This payload is what the LLM will receive.

Critical By this point, no LLM has been involved. Every fact in the payload is a deterministic output from log data. The LLM receives evidence — it does not discover it.

Process Flow Diagram

Case-Level Conformance Check

Structured Payload Sent to LLM

This is exactly what the LLM agent receives — a pre-computed, schema-validated JSON object. No raw event rows. Only analytical signals.

Most Frequent Path (Happy Path)

The "happy path" is not a prescribed or correct sequence — it is the most frequently observed path in the data. It represents how the majority of cases actually flow through the system, derived empirically from event log frequencies.

All Transitions (sorted by frequency)

Variant Explorer

Explore process variants — select one or more to see their flow and compare deviations from the most frequent path.

Variant Explorer

0% cases covered

	Variant	Count	Coverage	Pattern

Selected variant flow

Select a variant to view its process flow

Stage 3: LLM Reasoning

The LLM receives the structured anomaly payload and produces schema-validated JSON using function calling — not free text.

Without Function Calling

"Based on the following events, determine the cause and recommend an action."

Model returns free text. Any action string is possible. No schema enforcement. No auditability. The model can hallucinate actions that don't exist in your system.

With Function Calling

Model must return a schema-validated JSON object. Action field is an enum. rationale_ids must reference real event IDs.

{ "action": "issue_credit_memo", "confidence": 0.91, "rationale_ids": ["evt_112", "evt_203"] }

Why This Matters Function calling enforces the action space at the API contract level. The model cannot return an action outside the enum — the API call fails structurally before any downstream system is touched.

Expected Output Schema

{ "findings": [{ "pattern_id": "ghost_pick | cold_chain | dea_rework | credit_cascade | otif_fail", "title": "short executive-facing title (max 8 words)", "affected_cases": ["ORD-XXXX"], "root_cause": "operational explanation (2-3 sentences)", "downstream_consequence": "business and compliance impact", "process_intelligence_signal": "specific PI signal that reveals this", "recommended_action": "one concrete, actionable fix", "urgency": "critical | high | medium" }], "executive_summary": { "headline": "one sentence, dollar-impact framing", "key_insight": "the single most important non-obvious finding" } }

Run the Analysis

Send the structured PI payload to the LLM agent. The agent reasons over anomaly signals and returns structured findings in JSON.

Findings will appear here after running the analysis.

Stage 4: Confidence Calibration & Routing

An LLM's self-reported confidence is just another generated token. We replace it with a calibrated score derived from verifiable, external signals.

The Confidence Problem A model can be 94% confident and completely wrong. Confidence without calibration is worse than no confidence — it suppresses human review on exactly the cases that need it.

The Miscalibration Gap

What the LLM says	What the data shows	Gap
Self-reported: 0.91	Actual accuracy: 0.63	-0.28 (dangerous)
Self-reported: 0.72	Actual accuracy: 0.71	-0.01 (calibrated)
Self-reported: 0.55	Actual accuracy: 0.80	+0.25 (under-confident)

Four Independent Calibration Checks

CHECK 1

Faithfulness

RAGAS / TruLens: is each claim in the LLM's rationale supported by the provided event data? If the LLM says the invoice was posted before GR but the log shows the opposite, score drops sharply.

CHECK 2

Re-ranker

Cross-encoder model scores how relevant each cited event ID is to the anomaly type. Catches hallucinated but syntactically valid rationale_ids that aren't causally related.

CHECK 3

Sampling

Run the same case 5 times at temperature 0.7. Measure semantic entropy over action choices. A model that flips between actions is less trustworthy than one that's consistent.

CHECK 4

Meta-Model

XGBoost classifier trained on resolved cases. Inputs: LLM confidence, faithfulness, relevance, entropy, structural features. Output: calibrated probability that the action is correct.

The Key Insight You cannot calibrate an LLM's confidence from inside the LLM. You calibrate it from the outside, using evidence the LLM cannot fabricate. The meta-model's output probability replaces the LLM's self-reported score.

Three-Way Routing

> 0.85

Auto-Execute

Action passed to deterministic execution layer. Full audit trail logged.

0.60 – 0.85

Human-in-the-Loop

Case presented to analyst with evidence, rationale, and confidence breakdown. One-click approve or override.

< 0.60

Escalation

Routed to senior analyst. LLM output shown as suggestion only, not recommendation.

Dollar-Value Override Regardless of confidence, transactions above a defined threshold always require human approval. Non-negotiable for compliance.

Routing Simulation

After running the LLM analysis (Stage 3), each finding is scored and routed. This simulation shows how calibrated confidence drives the routing decision.

Run the LLM analysis in Stage 3 first to see routing simulation.

Stage 5: Bounded Execution

When the routing decision is 'auto-execute', the action intent passes to a deterministic orchestration layer — not directly to the ERP. This is the last line of defence.

GATE 1

Precondition Check

Business rules run independently of the LLM. "A credit memo cannot be issued if the original invoice is already paid." Enforced in code, not in a prompt.

GATE 2

Reversibility Flag

Irreversible actions (permanent blocks, large write-offs, supplier blacklisting) are always flagged and require second human confirmation — regardless of confidence.

GATE 3

Execute

Deterministic API call to the target system (ERP, WMS). The LLM's recommendation has already been validated. This is a direct system operation, not a prompt.

GATE 4

Audit Log

Every action — auto-executed, analyst-approved, or overridden — is logged with: timestamp, action, case_id, confidence score, who approved, outcome.

The LLM's Role Ended at Stage 3 From here, a deterministic orchestration layer validates and executes. Every action is pre-checked, flagged for reversibility, and logged. The LLM's recommendation is ignored if preconditions are not met.

Simulated Audit Trail

Each finding from the LLM analysis flows through the execution pipeline. Below shows the audit trail for each action.

Run the LLM analysis in Stage 3 first to see the audit trail.

Feedback Loop: The System That Improves Itself

Every resolved case — auto-resolved, analyst-approved, or overridden — is ground truth. This data continuously improves every component.

LOOP 1

Case Resolved

Action taken (by agent or analyst) is recorded with outcome: correct, overridden, or escalated.

LOOP 2

Ground Truth Label

Was the LLM's recommended action correct? This becomes a training row for the meta-model.

LOOP 3

Meta-Model Retrain

XGBoost classifier updated monthly on new labelled cases. Calibration thresholds reviewed and adjusted.

LOOP 4

Prompt Evolution

High-confidence correct cases added as few-shot examples. Low-quality cases trigger prompt review.

LOOP 5

Threshold Review

If override rate rises above 5%, routing thresholds tighten. If it drops below 1%, they can relax.

What Separates an Agentic Pipeline from a Chatbot The system has memory, learns from outcomes, and self-calibrates. The goal is not a system that is always right. The goal is a system that knows when it is likely to be wrong — and routes those cases to a human before acting.

Simulated Improvement Over Time

Model Accuracy

87% → 93%

after 3 monthly retrains

False Escalation Rate

12% → 5%

fewer unnecessary human reviews

Auto-Resolve Rate

41% → 68%

more cases handled autonomously

Override Rate

8% → 2.1%

analyst corrections declining

Key Takeaways

The Problem	The Engineering Response
LLM confidence is self-reported	Replace with a meta-model trained on ground truth outcomes
Action space is open-ended	Enforce via function calling schema — not prompting
Rationale can hallucinate	Validate rationale_ids against event log before acting
High-value actions need humans	Dollar-value overrides are non-negotiable for audit
System improves over time	Every resolved case is training data — close the loop

ROI Opportunity

Quantified value at risk across all detected process failure patterns. Low / high scenario modeling.

Assumptions

Pattern-Level Breakdown

Pattern	Primary Driver	Annual Low	Annual High

Raw Event Log

The source of truth — flat, unlabeled, no annotations. Process Intelligence derives all findings from this data.