Anomaly detection in production business metrics is deceptively hard. The data is noisy, seasonal, and non-stationary. True anomalies are rare (by definition), and the cost of missed detections varies wildly — a missed revenue dip at Amazon could mean millions, while a false alarm at 2 AM costs nothing but sleep.

At Amazon, I built and deployed a real-time anomaly detection system monitoring fulfillment center metrics. This post documents our systematic comparison of Isolation Forest (the classical approach) and Autoencoders (the deep learning approach), what we learned, and why we ended up using both.

The Problem: Business Metric Anomalies

We monitored 50+ metrics across multiple fulfillment centers: throughput rates, processing times, error rates, headcount efficiency, and equipment utilization. An anomaly might be:

Our legacy system used static thresholds (alert if metric > X or < Y). It generated hundreds of false positives per week because thresholds can't account for seasonality, day-of-week effects, or gradual trend changes.

Isolation Forest: The Tree-Based Approach

Isolation Forest works on a beautiful insight: anomalies are easier to isolate than normal points. Build random trees that recursively split the feature space. Anomalies, being different from the majority, get isolated in fewer splits.

from sklearn.ensemble import IsolationForest

model = IsolationForest(
    n_estimators=200,
    contamination=0.01,    # expected % of anomalies
    max_samples='auto',
    max_features=1.0,
    random_state=42
)

# Training on normal data
model.fit(X_train)

# Prediction: -1 = anomaly, 1 = normal
predictions = model.predict(X_new)
anomaly_scores = model.decision_function(X_new)  # continuous scores

Strengths

Weaknesses

Autoencoders: The Reconstruction Approach

An autoencoder learns to compress (encode) and reconstruct (decode) normal data. The key insight: anomalies will have high reconstruction error because the model has never learned to reconstruct abnormal patterns.

import torch
import torch.nn as nn

class AnomalyAutoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim=32, latent_dim=8):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, latent_dim),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, input_dim)
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

# Anomaly score = reconstruction error per sample
def anomaly_score(model, x):
    reconstructed = model(x)
    return torch.mean((x - reconstructed) ** 2, dim=1)

Strengths

Weaknesses

Head-to-Head: Our Benchmark Results

We benchmarked both approaches on 6 months of labeled production data (850K data points, 1.2% anomaly rate, labeled by domain experts):

Metric              | Isolation Forest | Autoencoder (LSTM) | Hybrid
--------------------|-----------------|--------------------|---------
Precision           |     0.72        |       0.81         |  0.88
Recall              |     0.68        |       0.79         |  0.85
F1 Score            |     0.70        |       0.80         |  0.86
False Positive Rate |     0.08        |       0.05         |  0.03
Avg Detection Delay |     12 min      |       5 min        |  4 min
Inference Latency   |     0.2 ms      |       8 ms         |  9 ms
Training Time       |     30 sec      |       45 min       |  46 min

The Hybrid Approach: Best of Both

Neither model alone was sufficient. Our production system uses a two-stage hybrid ensemble:

  1. Stage 1 — Isolation Forest (fast filter): Runs on every incoming data point at sub-millisecond latency. Catches obvious point anomalies immediately. High recall, moderate precision.
  2. Stage 2 — LSTM Autoencoder (deep analysis): Runs on windowed data (last 1 hour) for points flagged by IF or on a periodic schedule. Catches contextual and collective anomalies. Higher precision, catches slow drifts.
  3. Ensemble scoring: Final anomaly score = weighted combination. Both models agreeing = high-confidence alert (pages on-call). Only one model flagging = medium-confidence alert (notification, no page).
"The hybrid system reduced false positives by 20% over either model alone and improved critical anomaly detection by 30%. The key insight: use Isolation Forest for speed and recall, Autoencoders for depth and precision. Don't make it an either/or choice."

Lessons for Production Anomaly Detection

  1. Start with Isolation Forest. It's fast, works out of the box, and catches 70% of real anomalies. Add deep learning only when you've proven IF's limitations on your specific data.
  2. Invest in labeling. Without labeled anomaly data, you can't properly tune thresholds or compare models. We had domain experts label 3 months of data — costly but essential.
  3. Seasonality is the enemy. Remove daily, weekly, and yearly seasonality before feeding data to anomaly detectors. Otherwise, every Monday morning looks anomalous compared to Sunday.
  4. Alert fatigue kills systems. A system with 50 false alarms per day gets ignored within a week. Optimize for precision first, recall second. A missed anomaly is bad; a team that ignores all alerts is worse.
  5. Retrain continuously. Business metrics evolve. A model trained on 2022 data won't detect 2023 anomalies accurately. We retrain monthly on a rolling 6-month window.