Anomaly detection in production business metrics is deceptively hard. The data is noisy, seasonal, and non-stationary. True anomalies are rare (by definition), and the cost of missed detections varies wildly — a missed revenue dip at Amazon could mean millions, while a false alarm at 2 AM costs nothing but sleep.
At Amazon, I built and deployed a real-time anomaly detection system monitoring fulfillment center metrics. This post documents our systematic comparison of Isolation Forest (the classical approach) and Autoencoders (the deep learning approach), what we learned, and why we ended up using both.
The Problem: Business Metric Anomalies
We monitored 50+ metrics across multiple fulfillment centers: throughput rates, processing times, error rates, headcount efficiency, and equipment utilization. An anomaly might be:
- Point anomaly: A single metric value that's wildly out of range (throughput drops 50% in one hour)
- Contextual anomaly: A value that's normal in absolute terms but unusual given the context (high throughput at 3 AM when the facility should be idle)
- Collective anomaly: Multiple metrics deviating together in a pattern that's individually unremarkable but collectively suspicious
Our legacy system used static thresholds (alert if metric > X or < Y). It generated hundreds of false positives per week because thresholds can't account for seasonality, day-of-week effects, or gradual trend changes.
Isolation Forest: The Tree-Based Approach
Isolation Forest works on a beautiful insight: anomalies are easier to isolate than normal points. Build random trees that recursively split the feature space. Anomalies, being different from the majority, get isolated in fewer splits.
from sklearn.ensemble import IsolationForest
model = IsolationForest(
n_estimators=200,
contamination=0.01, # expected % of anomalies
max_samples='auto',
max_features=1.0,
random_state=42
)
# Training on normal data
model.fit(X_train)
# Prediction: -1 = anomaly, 1 = normal
predictions = model.predict(X_new)
anomaly_scores = model.decision_function(X_new) # continuous scores
Strengths
- Fast training and inference: Trains in seconds on millions of rows. Inference is sub-millisecond. Critical for real-time monitoring.
- No assumption on data distribution: Unlike statistical methods (z-score, Gaussian mixtures), Isolation Forest works on any distribution shape.
- Handles high-dimensional data: Performs well even with 50+ features without dimensionality reduction.
- Interpretable scores: The anomaly score has a natural interpretation — shorter average path length = more anomalous.
Weaknesses
- Blind to temporal patterns: It treats each observation independently. A metric slowly drifting 2% per day won't trigger until it crosses the anomaly boundary — by which time the problem has been building for weeks.
- Contamination parameter sensitivity: You need to estimate what percentage of your data is anomalous. Get this wrong and you either miss anomalies or generate floods of false positives.
- Struggles with correlated features: When anomalies manifest as unusual feature correlations (metric A is normal, metric B is normal, but A and B moving in opposite directions is anomalous), IF often misses them.
Autoencoders: The Reconstruction Approach
An autoencoder learns to compress (encode) and reconstruct (decode) normal data. The key insight: anomalies will have high reconstruction error because the model has never learned to reconstruct abnormal patterns.
import torch
import torch.nn as nn
class AnomalyAutoencoder(nn.Module):
def __init__(self, input_dim, hidden_dim=32, latent_dim=8):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(hidden_dim, latent_dim),
nn.ReLU()
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(hidden_dim, input_dim)
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
# Anomaly score = reconstruction error per sample
def anomaly_score(model, x):
reconstructed = model(x)
return torch.mean((x - reconstructed) ** 2, dim=1)
Strengths
- Captures complex correlations: The autoencoder learns the normal relationships between all features simultaneously. Unusual correlations produce high reconstruction error even when individual features are within range.
- Temporal awareness (with LSTM): By using LSTM or Transformer encoder layers, the autoencoder can learn temporal patterns — detecting slow drifts and sequence-level anomalies that Isolation Forest misses.
- Flexible reconstruction targets: You can weight features differently in the loss function — giving higher penalty to business-critical metrics and lower penalty to noisy ones.
Weaknesses
- Training complexity: Requires careful hyperparameter tuning (architecture, learning rate, latent dimension, dropout). Underfitting misses anomalies; overfitting reconstructs anomalies perfectly, defeating the purpose.
- Threshold selection: The reconstruction error is a continuous score. Choosing the right threshold requires labeled validation data or statistical methods (percentile-based). This is the hardest part in practice.
- Training data quality: If the training data contains undetected anomalies, the autoencoder learns to reconstruct them as normal. Data cleaning before training is critical.
- Slower inference: A forward pass through a neural network is 100-1000x slower than Isolation Forest. Still fast enough for real-time (5-10ms), but matters at very high throughput.
Head-to-Head: Our Benchmark Results
We benchmarked both approaches on 6 months of labeled production data (850K data points, 1.2% anomaly rate, labeled by domain experts):
Metric | Isolation Forest | Autoencoder (LSTM) | Hybrid
--------------------|-----------------|--------------------|---------
Precision | 0.72 | 0.81 | 0.88
Recall | 0.68 | 0.79 | 0.85
F1 Score | 0.70 | 0.80 | 0.86
False Positive Rate | 0.08 | 0.05 | 0.03
Avg Detection Delay | 12 min | 5 min | 4 min
Inference Latency | 0.2 ms | 8 ms | 9 ms
Training Time | 30 sec | 45 min | 46 min
The Hybrid Approach: Best of Both
Neither model alone was sufficient. Our production system uses a two-stage hybrid ensemble:
- Stage 1 — Isolation Forest (fast filter): Runs on every incoming data point at sub-millisecond latency. Catches obvious point anomalies immediately. High recall, moderate precision.
- Stage 2 — LSTM Autoencoder (deep analysis): Runs on windowed data (last 1 hour) for points flagged by IF or on a periodic schedule. Catches contextual and collective anomalies. Higher precision, catches slow drifts.
- Ensemble scoring: Final anomaly score = weighted combination. Both models agreeing = high-confidence alert (pages on-call). Only one model flagging = medium-confidence alert (notification, no page).
"The hybrid system reduced false positives by 20% over either model alone and improved critical anomaly detection by 30%. The key insight: use Isolation Forest for speed and recall, Autoencoders for depth and precision. Don't make it an either/or choice."
Lessons for Production Anomaly Detection
- Start with Isolation Forest. It's fast, works out of the box, and catches 70% of real anomalies. Add deep learning only when you've proven IF's limitations on your specific data.
- Invest in labeling. Without labeled anomaly data, you can't properly tune thresholds or compare models. We had domain experts label 3 months of data — costly but essential.
- Seasonality is the enemy. Remove daily, weekly, and yearly seasonality before feeding data to anomaly detectors. Otherwise, every Monday morning looks anomalous compared to Sunday.
- Alert fatigue kills systems. A system with 50 false alarms per day gets ignored within a week. Optimize for precision first, recall second. A missed anomaly is bad; a team that ignores all alerts is worse.
- Retrain continuously. Business metrics evolve. A model trained on 2022 data won't detect 2023 anomalies accurately. We retrain monthly on a rolling 6-month window.