Processing 10B+ Rows: PySpark + AWS Glue Architecture

At Amazon, our ML models were only as good as the data feeding them. The challenge: aggregate and transform 10 billion+ rows from multiple source systems into clean, ML-ready feature tables in Redshift — reliably, cost-efficiently, and fast enough to feed daily model retraining.

This case study covers the PySpark + AWS Glue architecture we built, the optimization patterns that cut job runtime by 60%, and the hard lessons about building data pipelines that don't break on Monday mornings.

The Data Landscape

Our ML systems consumed data from five source systems:

Operational events: Package scans, conveyor activity, equipment status — ~3B rows/month, streaming via Kinesis
Workforce data: Shift schedules, attendance, productivity metrics — ~500M rows/month from internal HR systems
Inventory data: SKU movements, bin occupancy, putaway/pick events — ~4B rows/month
Quality metrics: Damage reports, compliance inspections, customer returns — ~200M rows/month
External data: Weather, holiday calendars, demand forecasts — ~50M rows/month

Total: roughly 8-10 billion new rows per month flowing into S3, with historical data going back 3 years. The ML feature tables needed to combine columns from multiple sources, compute rolling aggregations, and be query-optimized for downstream model training.

Architecture: The Three-Zone Pattern

We organized S3 into three logical zones, each serving a different purpose:

S3 Data Lake Architecture:

s3://data-lake-raw/
  └── source=operational/year=2022/month=11/day=15/
  └── source=workforce/year=2022/month=11/day=15/
  └── source=inventory/...
  (Raw, unmodified data from source systems. Immutable archive.)

s3://data-lake-processed/
  └── domain=fulfillment/table=events_cleaned/
  └── domain=workforce/table=shifts_normalized/
  (Cleaned, validated, standardized data. Parquet format.)

s3://data-lake-features/
  └── model=damage_detection/version=v3/
  └── model=workforce_forecast/version=v2/
  (ML-ready feature tables. Optimized for training reads.)

Each zone has:
  - Separate IAM policies (raw is write-once, features need careful access)
  - Separate lifecycle rules (raw retains 3 years, processed 1 year)
  - Separate Glue catalog databases for schema management

AWS Glue: The ETL Engine

We chose AWS Glue over EMR for several practical reasons: no cluster management, automatic scaling, native S3/Redshift integration, and a Glue Data Catalog that served as our unified schema registry.

Job Architecture

Pipeline Structure (daily execution):

Job 1: Ingest & Validate (per source)
  - Read new partitions from raw zone
  - Schema validation (reject rows with missing required fields)
  - Data type coercion and normalization
  - Write to processed zone as Parquet with Snappy compression

Job 2: Feature Engineering (per model)
  - Read relevant tables from processed zone
  - Join across domains (operational + workforce + quality)
  - Compute rolling aggregations (7-day, 30-day, 90-day windows)
  - Write to features zone, partitioned by model and date

Job 3: Load to Redshift (per feature table)
  - COPY from S3 to Redshift staging tables
  - Merge into production tables (upsert pattern)
  - Run data quality checks post-load
  - Notify downstream ML pipelines

PySpark Optimization Patterns

1. Partition Strategy

The single most impactful optimization. Wrong partitioning kills everything — too few partitions cause OOM, too many cause small-file overhead.

# Bad: reading entire dataset to process one day
df = spark.read.parquet("s3://raw/operational/")
df_today = df.filter(col("date") == "2022-11-15")

# Good: partition pruning — only reads the relevant partition
df = spark.read.parquet("s3://raw/operational/year=2022/month=11/day=15/")

Impact: 95% reduction in data scanned (reads 300MB vs 6TB)

2. Broadcast Joins

When joining a large table (billions of rows) with a small reference table (thousands of rows), a shuffle join sends data across the cluster needlessly. Broadcasting the small table eliminates the shuffle entirely.

from pyspark.sql.functions import broadcast

# Large table: 3B rows of operational events
events = spark.read.parquet("s3://processed/events/")

# Small table: 500 rows of facility metadata
facilities = spark.read.parquet("s3://processed/facilities/")

# Broadcast join — small table replicated to all executors
result = events.join(broadcast(facilities), "facility_id")

# Default threshold: 10MB. We increased to 50MB for our use case.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "50m")

3. Avoiding Shuffles in Aggregations

Rolling window aggregations over billions of rows cause massive shuffles. Our optimization: pre-partition by the group key before windowing.

# Pre-partition to co-locate data for window functions
df = df.repartition(200, "facility_id")

# Now window operations on facility_id avoid cross-partition shuffles
window_spec = Window.partitionBy("facility_id") \
    .orderBy("event_date") \
    .rowsBetween(-6, 0)  # 7-day rolling window

df = df.withColumn("throughput_7d_avg",
    avg("daily_throughput").over(window_spec)
)

4. File Size Optimization

# Problem: Glue's default parallelism creates thousands of tiny files
# 10,000 files × 1MB each = slow reads, high S3 API costs

# Solution: coalesce before writing
df.coalesce(100).write \
    .mode("overwrite") \
    .partitionBy("date") \
    .parquet("s3://processed/events/")

# Target: 128MB-256MB per file (matches Spark's default read partition size)
# For 30GB daily output: 30GB / 200MB = ~150 files (use coalesce(150))

Redshift Loading Patterns

The COPY + Merge Pattern

-- Step 1: COPY from S3 to staging table (fast, parallel)
COPY staging_features
FROM 's3://features/workforce/date=2022-11-15/'
IAM_ROLE 'arn:aws:iam::role/RedshiftLoadRole'
FORMAT PARQUET;

-- Step 2: Merge into production table (upsert)
BEGIN TRANSACTION;

DELETE FROM production_features
USING staging_features
WHERE production_features.feature_id = staging_features.feature_id
  AND production_features.date = staging_features.date;

INSERT INTO production_features
SELECT * FROM staging_features;

DROP TABLE staging_features;

COMMIT;

Distribution and Sort Keys

Redshift query performance depends heavily on how data is distributed across nodes and sorted within nodes:

CREATE TABLE production_features (
    feature_id    BIGINT        ENCODE az64,
    facility_id   VARCHAR(20)   ENCODE lzo,
    date          DATE          ENCODE delta,
    throughput    FLOAT         ENCODE raw,
    headcount     INT           ENCODE az64,
    -- ... 50+ feature columns
)
DISTSTYLE KEY
DISTKEY (facility_id)           -- co-locates same facility data
SORTKEY (date, facility_id);    -- optimizes range scans on date

-- This reduced our ML training query time from 45 min to 12 min
-- Because Redshift skips irrelevant blocks entirely

Data Quality: The Unglamorous Necessity

Data quality issues caused more ML model failures than any model bug. Our quality framework:

Schema validation (ingestion): Reject rows missing required fields. Log schema violations with source and timestamp for debugging.
Freshness checks (daily): Alert if a source hasn't delivered data in the expected window. A "fresh" model trained on stale data is worse than no model.
Distribution checks (daily): Monitor feature distributions. If throughput_7d_avg suddenly has 40% nulls, something upstream broke.
Cross-source consistency (weekly): Verify that workforce headcount from HR matches operational headcount from badge scans. Discrepancies indicate data pipeline bugs.

"We spent 40% of our engineering time on data quality. It's not glamorous work, but it's the foundation. A sophisticated ML model trained on bad data produces confidently wrong predictions — the worst possible outcome."

Performance Results

Optimization Impact (daily pipeline, 10B+ row dataset):

Metric                    | Before  | After   | Improvement
Daily pipeline runtime    | 6.2 hrs | 2.4 hrs | 61% faster
Monthly AWS Glue cost     | $4,800  | $1,900  | 60% cheaper
Redshift query time (ML)  | 45 min  | 12 min  | 73% faster
S3 storage cost/month     | $2,100  | $850    | 60% cheaper
Failed job rate           | 12%/wk  | 1%/wk   | 92% more reliable
Data freshness SLA breach | 8/month | 0/month | 100% SLA compliance

Key Takeaways

Partition first, optimize second. Correct partitioning in S3 and Redshift solves 80% of performance problems. Everything else is refinement.
Idempotency is non-negotiable. Every job must be safely re-runnable. Date-partitioned writes with overwrite mode guarantee this.
Measure file sizes. Too many small files kills read performance. Too few large files kills write parallelism. Target 128-256MB per file.
Data quality > pipeline speed. A fast pipeline that produces bad data is worse than a slow pipeline that produces correct data. Build quality checks into every stage.
The three-zone pattern scales. Raw → Processed → Features with clear boundaries, separate access controls, and independent lifecycle policies. It's boring, but it works at 10B+ scale.

Processing 10B+ Rows: PySpark + AWS Glue Architecture Patterns