Most ML tutorials end at model.predict(). But in production, the model is maybe 10% of the system. The other 90% is data pipelines, feature engineering, serving infrastructure, monitoring, and all the glue that keeps the thing running at 3 AM when something breaks.

After 7+ years building ML systems at Amazon and EPAM — from 10B+ row data pipelines to real-time inference APIs — here are the system design patterns I keep coming back to.

The ML System Architecture Stack

Every production ML system, regardless of the model, has the same five layers:

  1. Data Layer: Where features are sourced, stored, and versioned
  2. Training Layer: Where models are trained, tuned, and validated
  3. Registry Layer: Where models are versioned, compared, and promoted
  4. Serving Layer: Where models make predictions at low latency
  5. Monitoring Layer: Where you detect when everything goes wrong

Most teams nail the training layer (that's the fun part) and completely neglect monitoring. This is how you end up with a model that worked great in November and is quietly making terrible predictions by February.

Data Pipelines: The Foundation

At Amazon, we built pipelines that processed 10 billion+ rows from S3 through AWS Glue into Redshift. The architecture pattern that survived scale:

S3 (Raw Data)
  → AWS Glue ETL (PySpark)
    → S3 (Processed Parquet)
      → Redshift (Feature Tables)
        → SageMaker (Training)

Key design decisions:
- Partitioned by date for incremental processing
- Schema validation at ingestion (catch bad data early)
- Idempotent jobs (re-runnable without duplication)
- Separate raw/processed/feature zones in S3

The most important principle: idempotent pipelines. If a job fails halfway through, you should be able to re-run it from scratch without creating duplicate data or corrupted state. We achieved this with date-partitioned writes and atomic swap operations in Redshift.

Schema Evolution

Data schemas change. Source teams add columns, rename fields, change data types. If your pipeline assumes a fixed schema, it will break — guaranteed. We handled this with:

Feature Engineering at Scale

Feature engineering is where most of the real model performance comes from. But feature engineering at scale introduces a unique challenge: training-serving skew.

"Training-serving skew is the #1 silent killer of ML systems. Your model trained on features computed in batch PySpark, but serving computes them in real-time Python. The subtle differences in implementation will degrade your model silently."

The solution: compute features once, serve everywhere. We used a feature store pattern where computed features were stored in a centralized table (Redshift for batch, Redis for real-time), and both training and serving read from the same source.

Model Training and Experimentation

At scale, model training is not a notebook activity. It's a pipeline:

  1. Data snapshot: Pin training data to a specific date range for reproducibility
  2. Feature computation: Generate features from the snapshot (not live data)
  3. Training: Train candidate model with logged hyperparameters
  4. Evaluation: Compare against current production model on holdout set
  5. Registration: If better, register in MLflow with metadata
  6. Approval: Human-in-the-loop review for critical models

We used SageMaker Training Jobs for compute-heavy training with spot instances (70% cost reduction). MLflow tracked every experiment — parameters, metrics, artifacts — so any model could be reproduced months later.

Model Serving Patterns

Batch Serving

For use cases that don't need real-time predictions (daily forecasts, weekly recommendations), batch serving is simpler and cheaper. Compute predictions on a schedule, store in a database, and let the application query pre-computed results.

Our workforce forecasting system at Amazon ran weekly batch predictions via Lambda → SageMaker Batch Transform → Redshift → QuickSight dashboard. No real-time inference needed, no latency concerns.

Real-Time Serving

For real-time needs (anomaly detection, damage classification), we used SageMaker Endpoints with autoscaling. Key considerations:

Monitoring: The Neglected Layer

Your model will degrade. It's not a question of if, but when. The patterns that fail can happen in two forms:

Data Drift

The distribution of input features changes over time. Customer behavior shifts, business conditions change, upstream data pipelines break subtly. We monitored feature distributions using population stability index (PSI) with alerts when any feature's PSI exceeded 0.2.

Model Drift

Even with stable inputs, the relationship between features and targets can shift. We tracked rolling prediction accuracy (MAPE for forecasting, AUC for classification) on a sliding window and set up automated retraining when performance dropped below threshold.

# Simplified monitoring pseudocode
for each model in production_models:
    recent_predictions = get_predictions(last_7_days)
    recent_actuals = get_actuals(last_7_days)
    
    current_metric = compute_metric(recent_predictions, recent_actuals)
    baseline_metric = model.baseline_metric
    
    if current_metric < baseline_metric * 0.85:  # 15% degradation
        trigger_alert(model, current_metric, baseline_metric)
        if auto_retrain_enabled:
            trigger_retraining_pipeline(model)

Cost Optimization

ML infrastructure costs can spiral quickly. Patterns that saved us significant budget:

The Pattern I Keep Coming Back To

After building ML systems at multiple companies and scales, the pattern that works is boringly consistent:

  1. Start with a simple model and a robust pipeline
  2. Get it to production fast (even if accuracy is mediocre)
  3. Instrument everything — monitoring, logging, alerting
  4. Iterate on model quality once the system is stable
  5. Automate retraining when you trust the pipeline
"Shipping a mediocre model in a robust system beats a perfect model in a notebook. You can always improve the model — but you can't improve a system that doesn't exist."

Conclusion

ML system design is infrastructure engineering with a statistical component — not the other way around. The teams that succeed are the ones that invest as heavily in pipelines, monitoring, and serving as they do in model development. The model is important, but the system is what delivers value.

If you're building ML systems and want to discuss architecture patterns, reach out — I love talking about this stuff.