Fraud Detection with LLM Explainability: From 8% to 1.2%

Fraud detection is a solved problem — until you need to explain why a transaction was flagged. Traditional ML models (XGBoost, random forests) are excellent at detecting patterns, but their explanations are incomprehensible to compliance teams, customer support, and regulators. SHAP plots and feature importance charts don't cut it when you need to justify a blocked transaction to a frustrated customer.

At EPAM, I designed an end-to-end fraud detection system that combines an ensemble ML pipeline for detection with LLM-powered explainability for human-readable justifications. The result: fraud rate dropped from 8% to 1.2%, with every flag accompanied by a clear, actionable explanation visible in a Streamlit dashboard.

The Two-System Architecture

The critical design decision: separate detection from explanation. Detection needs to be fast, deterministic, and reliable. Explanation can tolerate higher latency but must be accurate and human-readable. Forcing one system to do both creates unacceptable compromises.

Transaction Data
    ↓
[Detection Layer] Ensemble ML Pipeline (~50ms)
    ├── XGBoost (tabular features)
    ├── Rule Engine (hard business rules)
    └── Velocity Checks (time-window aggregates)
    ↓
Fraud Score + Feature Contributions
    ↓
[Explanation Layer] LLM Explainability (~500ms)
    ↓
Human-Readable Explanation
    ↓
Streamlit Dashboard

Detection Layer: Ensemble ML Pipeline

The detection layer is a classic ensemble approach, but with careful engineering for production reliability:

XGBoost core model: Trained on 50+ features including transaction amount, merchant category, time-of-day patterns, device fingerprints, and historical behavior. XGBoost over deep learning because tabular data + interpretable feature importance + fast inference.
Rule engine layer: Hard business rules that override model predictions. Known fraud patterns (gift card draining, velocity attacks) get rule-based detection because they're 100% precision — no need for probabilistic scoring.
Velocity checks: Time-window aggregations that catch burst patterns. "5 transactions in 2 minutes from the same card" is a signal the model might miss if it evaluates transactions independently.

# Ensemble scoring
def score_transaction(txn):
    xgb_score = xgb_model.predict_proba(txn.features)[1]
    rule_flag = rule_engine.evaluate(txn)
    velocity_flag = velocity_checker.check(txn.card_id, window="5min")

    if rule_flag:
        return 1.0, "rule_match"  # Hard override
    if velocity_flag:
        xgb_score = min(xgb_score * 1.5, 1.0)  # Boost score

    return xgb_score, "model"

Explanation Layer: LLM Explainability

When a transaction is flagged (score > threshold), the explanation layer generates a human-readable justification. The inputs to the LLM:

Top-5 SHAP features: The most influential features that drove the model's decision, with their direction and magnitude.
Transaction context: Amount, merchant, time, location, device — the raw facts.
Historical baseline: What's "normal" for this customer — average transaction size, typical merchants, usual time patterns.
Rule triggers: Which business rules were activated, if any.

# LLM explanation prompt (simplified)
EXPLAIN_PROMPT = """
Given this flagged transaction and the model's reasoning,
generate a clear, concise explanation for the compliance team.

Transaction: {txn_details}
Customer baseline: {baseline}
Top risk factors: {shap_features}
Rule triggers: {rules}

Write 2-3 sentences explaining WHY this transaction was flagged.
Focus on what's unusual compared to the customer's normal behavior.
"""

Example output: "This $2,340 purchase at an electronics retailer in Miami was flagged because the customer's average transaction is $87 and they've never shopped at this merchant before. Additionally, this is the 4th transaction in the last 30 minutes, which is unusual for this account."

Why Not End-to-End LLM?

The obvious question: why not use the LLM for both detection and explanation? Three reasons:

Latency: Detection must happen in <50ms for real-time scoring. LLM inference takes 500ms+. You can't block a payment for half a second.
Determinism: The same transaction features must always produce the same fraud score. LLMs are stochastic — even with temperature=0, there's variance across runs. Regulators require consistent, reproducible decisions.
Cost: Processing every transaction through an LLM would cost orders of magnitude more than XGBoost inference. The LLM is only called for flagged transactions (~2-5% of volume).

The Streamlit Dashboard

The end product is a Streamlit dashboard where the compliance team can:

Review flagged transactions with LLM-generated explanations alongside raw model outputs.
Approve or escalate flags with one click, creating an audit trail.
View trends in fraud patterns — which merchants, time windows, and transaction types are most affected.
Provide feedback that feeds back into model retraining. False positive flags become negative training examples.

Results: 8% → 1.2%

The combined system achieved dramatic fraud rate reduction:

Fraud rate: 8% → 1.2% (85% reduction)
False positive rate: Reduced by 40% vs previous rule-only system
Review time: Compliance team reviews 3x faster with LLM explanations vs raw model outputs
Detection latency: <50ms for scoring, ~500ms for explanation generation (async, non-blocking)

Lessons from Production

Explanation quality matters more than model accuracy. A slightly less accurate model with clear explanations gets adopted. A highly accurate model with opaque decisions gets bypassed by compliance teams who don't trust it.
LLM explanations need guardrails. Without constraints, the LLM occasionally generates explanations that contradict the model's actual reasoning. We validate that the explanation references the actual top SHAP features — not hallucinated reasoning.
Feedback loops are essential. The compliance team's approve/reject decisions are the highest-quality labels you'll ever get. Feeding these back into model retraining creates a virtuous cycle.
Rule engine handles known patterns. Don't force the ML model to learn patterns you already know. Hard-coded rules for known fraud signatures are faster, more reliable, and more interpretable.
Monitor explanation drift. As the LLM model updates (or if you switch providers), explanation quality can silently degrade. Automated evaluation on a held-out set of flagged transactions catches this.

"The best fraud detection system isn't the one with the highest precision — it's the one that the compliance team actually uses. LLM explainability bridges the gap between model intelligence and human trust."