At Amazon fulfillment centers, visual inspection is a critical quality gate. Damaged packages reaching customers cause returns, refunds, and trust erosion. Manual inspection doesn't scale — with millions of packages per day, humans catch maybe 60-70% of damaged items. We needed a computer vision system that could classify damage in real-time at conveyor-belt speed.

This case study documents our journey from a basic CNN to a production ResNet deployment — the architecture decisions, accuracy vs latency trade-offs, and the $418K in annual savings that justified the infrastructure investment.

The Problem: Package Damage Classification

The system needed to classify packages into three categories from camera images captured at conveyor belt stations:

Constraints that shaped our architecture decisions:

Approach 1: Custom CNN

We started with a straightforward CNN architecture — 5 convolutional blocks with batch normalization, max pooling, and a final classifier head:

Custom CNN Architecture:
  Input (224×224×3)
    → Conv(32, 3×3) → BN → ReLU → MaxPool(2×2)
    → Conv(64, 3×3) → BN → ReLU → MaxPool(2×2)
    → Conv(128, 3×3) → BN → ReLU → MaxPool(2×2)
    → Conv(256, 3×3) → BN → ReLU → MaxPool(2×2)
    → Conv(512, 3×3) → BN → ReLU → GlobalAvgPool
    → FC(256) → Dropout(0.5) → FC(3) → Softmax

Parameters: ~8.2M
Model size: ~33 MB (FP32)

Results

Custom CNN Performance:
  Overall Accuracy:     84.3%
  Critical Damage Recall: 82.1%
  False Positive Rate:    6.8%
  Inference Latency:      12 ms (SageMaker ml.g4dn.xlarge)
  Training Time:          2.5 hours (50K images)

The CNN performed reasonably but fell short of our 90% critical damage recall target. Analysis of misclassifications revealed the model struggled with subtle damage patterns — small tears partially hidden by tape, slight crushing on edges, and water damage that only discolored one corner. These require the model to learn fine-grained texture features that a 5-layer CNN couldn't capture.

Approach 2: ResNet-50 with Transfer Learning

ResNet's skip connections solve the vanishing gradient problem that limits deep plain CNNs. The residual blocks learn corrections to the input rather than full transformations:

ResNet Residual Block:
  Input → [Conv → BN → ReLU → Conv → BN] + Input → ReLU → Output
                                            ↑
                                      skip connection

ResNet-50 for our task:
  - Backbone: ImageNet pre-trained ResNet-50 (25.6M params)
  - Replaced final FC layer: FC(2048 → 3)
  - Fine-tuning strategy: freeze first 3 stages, fine-tune stage 4 + head
  - Data augmentation: random crop, flip, rotation, color jitter, brightness

Results

ResNet-50 Performance:
  Overall Accuracy:     92.7%  (+8.4% vs CNN)
  Critical Damage Recall: 94.2%  (+12.1% vs CNN)
  False Positive Rate:    3.1%   (-3.7% vs CNN)
  Inference Latency:      28 ms  (SageMaker ml.g4dn.xlarge)
  Training Time:          4 hours (50K images, fine-tuned)

Per-class breakdown:
  Undamaged:      Precision 0.95 | Recall 0.96
  Minor Damage:   Precision 0.87 | Recall 0.84
  Critical Damage: Precision 0.91 | Recall 0.94

The jump from 82% to 94% critical damage recall was dramatic. Transfer learning was the key — ImageNet pre-training gave the model rich texture and edge features that a model trained only on 50K fulfillment images couldn't learn from scratch.

The Latency vs Accuracy Trade-off

ResNet-50 hit 28ms inference — well within our 100ms budget. But we also evaluated lighter alternatives:

Architecture Comparison:
  Model           | Accuracy | Critical Recall | Latency | Model Size
  Custom CNN      |  84.3%   |     82.1%       |  12 ms  |   33 MB
  ResNet-18       |  89.1%   |     89.8%       |  15 ms  |   45 MB
  ResNet-50       |  92.7%   |     94.2%       |  28 ms  |   98 MB
  ResNet-101      |  93.1%   |     94.6%       |  52 ms  |  171 MB
  EfficientNet-B0 |  91.8%   |     93.5%       |  22 ms  |   21 MB

ResNet-101 only gained 0.4% over ResNet-50 while nearly doubling latency. Not worth it. EfficientNet-B0 was interesting — competitive accuracy with the smallest model size — but ResNet-50's maturity and well-understood deployment on SageMaker tipped the scale.

"In production ML, the right model isn't the one with the highest accuracy on a test set. It's the one that meets your accuracy threshold with acceptable latency, fits your deployment infrastructure, and has a track record of stable inference. ResNet-50 checks all those boxes."

Production Deployment on SageMaker

The deployment architecture:

  1. Image capture: Industrial cameras at conveyor belt stations capture images triggered by proximity sensors.
  2. Preprocessing: Lambda function resizes images to 224x224 and normalizes pixel values. Runs in <5ms.
  3. Inference: SageMaker real-time endpoint with ResNet-50. Auto-scaling based on throughput — 2 instances during peak, 1 during off-hours.
  4. Action: Critical damage predictions trigger a divert signal to the conveyor system. Minor damage flags are queued for human review.
  5. Monitoring: Prediction distribution tracked hourly. If the model starts classifying >10% as critical (normal is ~3%), an alarm fires — likely a data quality issue (dirty camera lens, lighting change).

INT8 Quantization for Cost Optimization

Post-training quantization (PTQ) reduced the model from FP32 to INT8:

Quantization Impact:
  FP32: 98 MB, 28 ms inference, ml.g4dn.xlarge ($0.526/hr)
  INT8: 25 MB, 18 ms inference, ml.g4dn.xlarge ($0.526/hr)

  Accuracy drop: 92.7% → 92.3% (negligible)
  Latency improvement: 36% faster
  Can run on CPU: ml.c5.xlarge ($0.17/hr) at 45 ms → 68% cost savings

Business Impact: The $418K Story

The cost impact breakdown that justified the project:

Safety Compliance Extension

After the damage detection system proved its value, we extended the same ResNet architecture for truck loading safety compliance. The model classifies whether packages are loaded following safety protocols (stacking patterns, weight distribution, securing methods).

For compliance, we fine-tuned from our damage detection model rather than from ImageNet — the domain similarity (fulfillment center imagery) meant our features transferred better than generic ImageNet features. This reduced training data requirements from 50K to 15K labeled images for comparable accuracy, and manual annotation effort dropped by 18%.

Key Takeaways

  1. Transfer learning is not optional for small datasets. With 50K images, training from scratch topped out at 84%. Pre-trained ResNet hit 93%. The features learned from ImageNet's 1.2M images are remarkably transferable.
  2. Don't over-architect. ResNet-50 from 2015 beat our custom CNN by 8.4%. Sometimes the boring, well-understood model is the right choice.
  3. Quantize for deployment. INT8 quantization saved 68% on serving costs with negligible accuracy loss. This should be the default for classification models.
  4. Build the monitoring before the model. We caught two production issues through monitoring (dirty camera lenses, lighting changes) that would have silently degraded accuracy for weeks.