At Amazon, our ML models were only as good as the data feeding them. The challenge: aggregate and transform 10 billion+ rows from multiple source systems into clean, ML-ready feature tables in Redshift — reliably, cost-efficiently, and fast enough to feed daily model retraining.
This case study covers the PySpark + AWS Glue architecture we built, the optimization patterns that cut job runtime by 60%, and the hard lessons about building data pipelines that don't break on Monday mornings.
The Data Landscape
Our ML systems consumed data from five source systems:
- Operational events: Package scans, conveyor activity, equipment status — ~3B rows/month, streaming via Kinesis
- Workforce data: Shift schedules, attendance, productivity metrics — ~500M rows/month from internal HR systems
- Inventory data: SKU movements, bin occupancy, putaway/pick events — ~4B rows/month
- Quality metrics: Damage reports, compliance inspections, customer returns — ~200M rows/month
- External data: Weather, holiday calendars, demand forecasts — ~50M rows/month
Total: roughly 8-10 billion new rows per month flowing into S3, with historical data going back 3 years. The ML feature tables needed to combine columns from multiple sources, compute rolling aggregations, and be query-optimized for downstream model training.
Architecture: The Three-Zone Pattern
We organized S3 into three logical zones, each serving a different purpose:
S3 Data Lake Architecture:
s3://data-lake-raw/
└── source=operational/year=2022/month=11/day=15/
└── source=workforce/year=2022/month=11/day=15/
└── source=inventory/...
(Raw, unmodified data from source systems. Immutable archive.)
s3://data-lake-processed/
└── domain=fulfillment/table=events_cleaned/
└── domain=workforce/table=shifts_normalized/
(Cleaned, validated, standardized data. Parquet format.)
s3://data-lake-features/
└── model=damage_detection/version=v3/
└── model=workforce_forecast/version=v2/
(ML-ready feature tables. Optimized for training reads.)
Each zone has:
- Separate IAM policies (raw is write-once, features need careful access)
- Separate lifecycle rules (raw retains 3 years, processed 1 year)
- Separate Glue catalog databases for schema management
AWS Glue: The ETL Engine
We chose AWS Glue over EMR for several practical reasons: no cluster management, automatic scaling, native S3/Redshift integration, and a Glue Data Catalog that served as our unified schema registry.
Job Architecture
Pipeline Structure (daily execution):
Job 1: Ingest & Validate (per source)
- Read new partitions from raw zone
- Schema validation (reject rows with missing required fields)
- Data type coercion and normalization
- Write to processed zone as Parquet with Snappy compression
Job 2: Feature Engineering (per model)
- Read relevant tables from processed zone
- Join across domains (operational + workforce + quality)
- Compute rolling aggregations (7-day, 30-day, 90-day windows)
- Write to features zone, partitioned by model and date
Job 3: Load to Redshift (per feature table)
- COPY from S3 to Redshift staging tables
- Merge into production tables (upsert pattern)
- Run data quality checks post-load
- Notify downstream ML pipelines
PySpark Optimization Patterns
1. Partition Strategy
The single most impactful optimization. Wrong partitioning kills everything — too few partitions cause OOM, too many cause small-file overhead.
# Bad: reading entire dataset to process one day
df = spark.read.parquet("s3://raw/operational/")
df_today = df.filter(col("date") == "2022-11-15")
# Good: partition pruning — only reads the relevant partition
df = spark.read.parquet("s3://raw/operational/year=2022/month=11/day=15/")
Impact: 95% reduction in data scanned (reads 300MB vs 6TB)
2. Broadcast Joins
When joining a large table (billions of rows) with a small reference table (thousands of rows), a shuffle join sends data across the cluster needlessly. Broadcasting the small table eliminates the shuffle entirely.
from pyspark.sql.functions import broadcast
# Large table: 3B rows of operational events
events = spark.read.parquet("s3://processed/events/")
# Small table: 500 rows of facility metadata
facilities = spark.read.parquet("s3://processed/facilities/")
# Broadcast join — small table replicated to all executors
result = events.join(broadcast(facilities), "facility_id")
# Default threshold: 10MB. We increased to 50MB for our use case.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "50m")
3. Avoiding Shuffles in Aggregations
Rolling window aggregations over billions of rows cause massive shuffles. Our optimization: pre-partition by the group key before windowing.
# Pre-partition to co-locate data for window functions
df = df.repartition(200, "facility_id")
# Now window operations on facility_id avoid cross-partition shuffles
window_spec = Window.partitionBy("facility_id") \
.orderBy("event_date") \
.rowsBetween(-6, 0) # 7-day rolling window
df = df.withColumn("throughput_7d_avg",
avg("daily_throughput").over(window_spec)
)
4. File Size Optimization
# Problem: Glue's default parallelism creates thousands of tiny files
# 10,000 files × 1MB each = slow reads, high S3 API costs
# Solution: coalesce before writing
df.coalesce(100).write \
.mode("overwrite") \
.partitionBy("date") \
.parquet("s3://processed/events/")
# Target: 128MB-256MB per file (matches Spark's default read partition size)
# For 30GB daily output: 30GB / 200MB = ~150 files (use coalesce(150))
Redshift Loading Patterns
The COPY + Merge Pattern
-- Step 1: COPY from S3 to staging table (fast, parallel)
COPY staging_features
FROM 's3://features/workforce/date=2022-11-15/'
IAM_ROLE 'arn:aws:iam::role/RedshiftLoadRole'
FORMAT PARQUET;
-- Step 2: Merge into production table (upsert)
BEGIN TRANSACTION;
DELETE FROM production_features
USING staging_features
WHERE production_features.feature_id = staging_features.feature_id
AND production_features.date = staging_features.date;
INSERT INTO production_features
SELECT * FROM staging_features;
DROP TABLE staging_features;
COMMIT;
Distribution and Sort Keys
Redshift query performance depends heavily on how data is distributed across nodes and sorted within nodes:
CREATE TABLE production_features (
feature_id BIGINT ENCODE az64,
facility_id VARCHAR(20) ENCODE lzo,
date DATE ENCODE delta,
throughput FLOAT ENCODE raw,
headcount INT ENCODE az64,
-- ... 50+ feature columns
)
DISTSTYLE KEY
DISTKEY (facility_id) -- co-locates same facility data
SORTKEY (date, facility_id); -- optimizes range scans on date
-- This reduced our ML training query time from 45 min to 12 min
-- Because Redshift skips irrelevant blocks entirely
Data Quality: The Unglamorous Necessity
Data quality issues caused more ML model failures than any model bug. Our quality framework:
- Schema validation (ingestion): Reject rows missing required fields. Log schema violations with source and timestamp for debugging.
- Freshness checks (daily): Alert if a source hasn't delivered data in the expected window. A "fresh" model trained on stale data is worse than no model.
- Distribution checks (daily): Monitor feature distributions. If throughput_7d_avg suddenly has 40% nulls, something upstream broke.
- Cross-source consistency (weekly): Verify that workforce headcount from HR matches operational headcount from badge scans. Discrepancies indicate data pipeline bugs.
"We spent 40% of our engineering time on data quality. It's not glamorous work, but it's the foundation. A sophisticated ML model trained on bad data produces confidently wrong predictions — the worst possible outcome."
Performance Results
Optimization Impact (daily pipeline, 10B+ row dataset):
Metric | Before | After | Improvement
Daily pipeline runtime | 6.2 hrs | 2.4 hrs | 61% faster
Monthly AWS Glue cost | $4,800 | $1,900 | 60% cheaper
Redshift query time (ML) | 45 min | 12 min | 73% faster
S3 storage cost/month | $2,100 | $850 | 60% cheaper
Failed job rate | 12%/wk | 1%/wk | 92% more reliable
Data freshness SLA breach | 8/month | 0/month | 100% SLA compliance
Key Takeaways
- Partition first, optimize second. Correct partitioning in S3 and Redshift solves 80% of performance problems. Everything else is refinement.
- Idempotency is non-negotiable. Every job must be safely re-runnable. Date-partitioned writes with overwrite mode guarantee this.
- Measure file sizes. Too many small files kills read performance. Too few large files kills write parallelism. Target 128-256MB per file.
- Data quality > pipeline speed. A fast pipeline that produces bad data is worse than a slow pipeline that produces correct data. Build quality checks into every stage.
- The three-zone pattern scales. Raw → Processed → Features with clear boundaries, separate access controls, and independent lifecycle policies. It's boring, but it works at 10B+ scale.