At Amazon fulfillment centers, visual inspection is a critical quality gate. Damaged packages reaching customers cause returns, refunds, and trust erosion. Manual inspection doesn't scale — with millions of packages per day, humans catch maybe 60-70% of damaged items. We needed a computer vision system that could classify damage in real-time at conveyor-belt speed.
This case study documents our journey from a basic CNN to a production ResNet deployment — the architecture decisions, accuracy vs latency trade-offs, and the $418K in annual savings that justified the infrastructure investment.
The Problem: Package Damage Classification
The system needed to classify packages into three categories from camera images captured at conveyor belt stations:
- Undamaged: Normal package, proceed through fulfillment
- Minor damage: Cosmetic damage (dents, scuffs) — flag for secondary review
- Critical damage: Structural damage (crushed, torn, wet) — divert immediately
Constraints that shaped our architecture decisions:
- Latency: <100ms per image (conveyor belt speed doesn't wait for slow models)
- Accuracy: >90% on critical damage (missed critical damage = customer impact)
- False positive tolerance: <5% (too many false diversions disrupt throughput)
- Data: ~50K labeled images (expensive to annotate fulfillment center images)
Approach 1: Custom CNN
We started with a straightforward CNN architecture — 5 convolutional blocks with batch normalization, max pooling, and a final classifier head:
Custom CNN Architecture:
Input (224×224×3)
→ Conv(32, 3×3) → BN → ReLU → MaxPool(2×2)
→ Conv(64, 3×3) → BN → ReLU → MaxPool(2×2)
→ Conv(128, 3×3) → BN → ReLU → MaxPool(2×2)
→ Conv(256, 3×3) → BN → ReLU → MaxPool(2×2)
→ Conv(512, 3×3) → BN → ReLU → GlobalAvgPool
→ FC(256) → Dropout(0.5) → FC(3) → Softmax
Parameters: ~8.2M
Model size: ~33 MB (FP32)
Results
Custom CNN Performance:
Overall Accuracy: 84.3%
Critical Damage Recall: 82.1%
False Positive Rate: 6.8%
Inference Latency: 12 ms (SageMaker ml.g4dn.xlarge)
Training Time: 2.5 hours (50K images)
The CNN performed reasonably but fell short of our 90% critical damage recall target. Analysis of misclassifications revealed the model struggled with subtle damage patterns — small tears partially hidden by tape, slight crushing on edges, and water damage that only discolored one corner. These require the model to learn fine-grained texture features that a 5-layer CNN couldn't capture.
Approach 2: ResNet-50 with Transfer Learning
ResNet's skip connections solve the vanishing gradient problem that limits deep plain CNNs. The residual blocks learn corrections to the input rather than full transformations:
ResNet Residual Block:
Input → [Conv → BN → ReLU → Conv → BN] + Input → ReLU → Output
↑
skip connection
ResNet-50 for our task:
- Backbone: ImageNet pre-trained ResNet-50 (25.6M params)
- Replaced final FC layer: FC(2048 → 3)
- Fine-tuning strategy: freeze first 3 stages, fine-tune stage 4 + head
- Data augmentation: random crop, flip, rotation, color jitter, brightness
Results
ResNet-50 Performance:
Overall Accuracy: 92.7% (+8.4% vs CNN)
Critical Damage Recall: 94.2% (+12.1% vs CNN)
False Positive Rate: 3.1% (-3.7% vs CNN)
Inference Latency: 28 ms (SageMaker ml.g4dn.xlarge)
Training Time: 4 hours (50K images, fine-tuned)
Per-class breakdown:
Undamaged: Precision 0.95 | Recall 0.96
Minor Damage: Precision 0.87 | Recall 0.84
Critical Damage: Precision 0.91 | Recall 0.94
The jump from 82% to 94% critical damage recall was dramatic. Transfer learning was the key — ImageNet pre-training gave the model rich texture and edge features that a model trained only on 50K fulfillment images couldn't learn from scratch.
The Latency vs Accuracy Trade-off
ResNet-50 hit 28ms inference — well within our 100ms budget. But we also evaluated lighter alternatives:
Architecture Comparison:
Model | Accuracy | Critical Recall | Latency | Model Size
Custom CNN | 84.3% | 82.1% | 12 ms | 33 MB
ResNet-18 | 89.1% | 89.8% | 15 ms | 45 MB
ResNet-50 | 92.7% | 94.2% | 28 ms | 98 MB
ResNet-101 | 93.1% | 94.6% | 52 ms | 171 MB
EfficientNet-B0 | 91.8% | 93.5% | 22 ms | 21 MB
ResNet-101 only gained 0.4% over ResNet-50 while nearly doubling latency. Not worth it. EfficientNet-B0 was interesting — competitive accuracy with the smallest model size — but ResNet-50's maturity and well-understood deployment on SageMaker tipped the scale.
"In production ML, the right model isn't the one with the highest accuracy on a test set. It's the one that meets your accuracy threshold with acceptable latency, fits your deployment infrastructure, and has a track record of stable inference. ResNet-50 checks all those boxes."
Production Deployment on SageMaker
The deployment architecture:
- Image capture: Industrial cameras at conveyor belt stations capture images triggered by proximity sensors.
- Preprocessing: Lambda function resizes images to 224x224 and normalizes pixel values. Runs in <5ms.
- Inference: SageMaker real-time endpoint with ResNet-50. Auto-scaling based on throughput — 2 instances during peak, 1 during off-hours.
- Action: Critical damage predictions trigger a divert signal to the conveyor system. Minor damage flags are queued for human review.
- Monitoring: Prediction distribution tracked hourly. If the model starts classifying >10% as critical (normal is ~3%), an alarm fires — likely a data quality issue (dirty camera lens, lighting change).
INT8 Quantization for Cost Optimization
Post-training quantization (PTQ) reduced the model from FP32 to INT8:
Quantization Impact:
FP32: 98 MB, 28 ms inference, ml.g4dn.xlarge ($0.526/hr)
INT8: 25 MB, 18 ms inference, ml.g4dn.xlarge ($0.526/hr)
Accuracy drop: 92.7% → 92.3% (negligible)
Latency improvement: 36% faster
Can run on CPU: ml.c5.xlarge ($0.17/hr) at 45 ms → 68% cost savings
Business Impact: The $418K Story
The cost impact breakdown that justified the project:
- Reduced damage-related returns: $285K/year — catching critical damage before shipping eliminated 65% of damage-related returns.
- Reduced manual inspection labor: $98K/year — automated classification reduced the manual inspection headcount by 3 FTE equivalents.
- Reduced re-shipping costs: $35K/year — packages diverted at the fulfillment center cost far less than returns from customers.
- Infrastructure cost: -$52K/year — SageMaker endpoints, S3 storage, and Lambda execution.
- Net annual savings: $418K
Safety Compliance Extension
After the damage detection system proved its value, we extended the same ResNet architecture for truck loading safety compliance. The model classifies whether packages are loaded following safety protocols (stacking patterns, weight distribution, securing methods).
For compliance, we fine-tuned from our damage detection model rather than from ImageNet — the domain similarity (fulfillment center imagery) meant our features transferred better than generic ImageNet features. This reduced training data requirements from 50K to 15K labeled images for comparable accuracy, and manual annotation effort dropped by 18%.
Key Takeaways
- Transfer learning is not optional for small datasets. With 50K images, training from scratch topped out at 84%. Pre-trained ResNet hit 93%. The features learned from ImageNet's 1.2M images are remarkably transferable.
- Don't over-architect. ResNet-50 from 2015 beat our custom CNN by 8.4%. Sometimes the boring, well-understood model is the right choice.
- Quantize for deployment. INT8 quantization saved 68% on serving costs with negligible accuracy loss. This should be the default for classification models.
- Build the monitoring before the model. We caught two production issues through monitoring (dirty camera lenses, lighting changes) that would have silently degraded accuracy for weeks.