Why My First Recommender System Failed in Production

Everyone shows their successes. Almost nobody shows what failed and why. This is the story of a recommender system that looked perfect in my Jupyter notebook — and completely fell apart the day we turned it on in production.

The Setup

I was building a personalized offer recommendation engine for a global QSR (Quick Service Restaurant) chain. The goal was simple: show each loyalty user the offers they're most likely to redeem, instead of blasting everyone with the same promotions.

I had 8 months of redemption history, user behavioral features, and offer metadata. Classic recommendation problem. I was confident.

Spoiler: my confidence was the first problem.

What I Assumed

Assumption #1: Collaborative Filtering Would Be Enough

I started with pure collaborative filtering — matrix factorization using user-item interaction data. If user A and user B redeemed similar offers, recommend user A's unredeemed offers to user B. Classic. Textbook.

My offline metrics were solid: precision@3 = 0.31, recall@10 = 0.42. I was ready to ship.

Assumption #2: The Data Was Clean Enough

I assumed the interaction matrix was reliable. A user redeeming an offer = positive signal. A user NOT redeeming an offer = negative signal. Simple binary classification.

Assumption #3: One Model Fits All Markets

We operated in multiple countries. I trained a single global model because "more data is always better." Germany, Switzerland, and other markets all pooled together.

What Broke

Week 1 — Launch Day

Failure #1: New offers got zero recommendations

The restaurant chain launched new promotional offers every 1-2 weeks. Pure collaborative filtering needs interaction history to work. New offers had zero interactions → zero scores → never surfaced to any user.

The ops team uploaded 6 new offers on launch day. All 6 were invisible. The model was recommending stale, expired-adjacent offers because those had the most historical interactions.

Impact

100% of new offers got zero impressions in the first week. The marketing team escalated within 48 hours.

Week 2 — The Data Lie

Failure #2: Implicit feedback =≠ explicit ratings

Here's what I didn't account for: the absence of a click doesn't mean the user dislikes the item. It might mean they never saw it.

My interaction matrix treated all missing values as negative signals. But users only see 3-5 offers per session out of 40+ available. 90% of "negative" signals were actually missing data, not dislike.

This created a vicious cycle: popular offers got more impressions → more interactions → higher scores → more impressions. New and niche offers starved. The model was essentially a popularity ranker with extra steps.

Week 3 — Market Collapse

Failure #3: The global model was wrong for every market

Germany and Switzerland have fundamentally different food preferences, discount sensitivities, and seasonal patterns. When I trained one model on pooled data, it averaged out the signal and performed poorly everywhere.

A/B test results from week 3: the personalized model performed WORSE than random assignment in the Swiss market. Worse than random. That's the kind of result that makes you question your career.

The Numbers That Hurt

Germany: +3% redemption lift vs control (barely significant)
Switzerland: -4% redemption vs control (worse than random)
New offers: 0 impressions in first 7 days
Popularity bias: Top 5 offers got 78% of all recommendations

What I Changed

I didn't throw everything away. I stepped back, diagnosed each failure mode, and rebuilt the system piece by piece. Here's exactly what changed:

Fix #1: Hybrid model with feature-sum embeddings (LightFM)

The Solution

Replaced pure collaborative filtering with LightFM's hybrid approach. The key insight: each user and item is represented as the sum of their feature embeddings, not a single ID embedding.

A new offer with features [BURGER, DISCOUNT_30, COMBO] immediately gets a meaningful embedding vector — the sum of the learned embeddings for each feature. No interaction history required.

score(u, i) = sigmoid( q_u · p_i + b_u + b_i )

where:
  q_u = SUM of user u's feature embeddings
  p_i = SUM of item i's feature embeddings

# New offers instantly get a representation
# from their categorical features

I also added a cold-start fallback: offers with zero interactions get their identity features set to the average of interacted items' features. This prevents new items from getting degenerate zero-vector representations.

Fix #2: WARP loss instead of BPR

The Solution

Switched from BPR (Bayesian Personalized Ranking) to WARP (Weighted Approximate-Rank Pairwise) loss. WARP specifically optimizes for top-K ranking quality by sampling negative items and focusing gradient updates on ranking violations.

Crucially, WARP handles the implicit feedback problem better — it doesn't treat all missing interactions as equally negative. The number of sampling attempts before finding a violation is itself informative.

Fix #3: Per-market models with YAML configs

The Solution

Killed the global model. Each market now has its own YAML config defining hyperparameters, feature sets, and training windows:

# Germany config
model:
  training_data_start_date: "2025-01-01"
  offer_categorical_features:
    - MENU_SIZE
    - PROTEIN
    - DISCOUNT_TYPE
  number_of_predictions: 3

lightfm_model:
  item_alpha: 5e-4
  user_alpha: 1e-5
  no_components: 128
  epochs: 8

Germany and Switzerland ended up needing different features, different embedding dimensions, and different training windows. The "more data is better" assumption was wrong — more relevant data is better.

Fix #4: A/B testing every assignment cycle

The Solution

Built-in test/control splits in every single assignment cycle. TEST group gets personalized offers from the model. CONTROL group gets random or popularity-based assignments. No more "trust the model" — measure everything, every cycle.

The Results (After Fixes)

After 4 weeks of rebuilding and 2 assignment cycles of A/B testing:

+12% redemption rate lift (test vs control) — consistently across cycles
New offers surfaced within 24 hours of upload — no more cold-start blindness
Per-market lift: Germany +14%, Switzerland +9% (from -4% to +9%!)
Popularity bias reduced: Top 5 offers went from 78% to 34% of recommendations

What I Learned (The Hard Way)

Your notebook metrics are lying to you. Precision@3 = 0.31 in offline evaluation meant nothing when the model couldn't handle new items. Offline metrics don't capture cold-start, popularity bias, or real user behavior.
Implicit feedback ≠ explicit ratings. The absence of a click is ambiguous. Treating it as a negative example is the most common mistake in production recommender systems. Use loss functions designed for implicit feedback (WARP, not BPR or logistic loss).
"More data" isn't always better. Pooling markets destroyed per-market signal. Each market needed its own model with its own configuration. The overhead of per-market configs is worth it.
Cold-start is a recurring problem, not a one-time fix. New offers every 1-2 weeks means cold-start is a feature of your system, not an edge case. Your architecture must handle it natively.
Build the measurement infrastructure before the model. If I had A/B testing from day one, I would have caught these failures in week 1 with a controlled experiment instead of week 3 when the damage was done.
Give ops the tools to see what's happening. The Streamlit dashboard we built later — showing redemption rates, test vs control, offer coverage — would have surfaced the new-offer problem immediately.

The system around the model matters more than the model itself. A well-engineered hybrid model with proper cold-start handling and rigorous A/B testing will outperform a fancy deep learning model deployed blind.

The Architecture That Worked

After the rebuild, here's what the production system looks like:

Snowflake Feature Store
  (User features + Offer features + Interaction history)
    ↓
Training Pipeline (Snowpark)
  LightFM · WARP loss · 128-dim · per-market configs
    ↓
Model Artifact → Snowflake Internal Stage
    ↓
Prediction Pipeline
  Top-3 personalized offers per user → assignment table
    ↓
Streamlit Dashboard
  Upload offers · Manage flags · View A/B results

The entire pipeline runs natively on Snowflake — no external orchestrator, no separate model server. Adding a new country is just a new YAML config file and market-specific Snowflake infra.

Would I Do It Differently Today?

Yes. Three things I'd change:

Start with A/B infrastructure, not the model. Build the measurement system first. You can't improve what you can't measure.
Deploy the simplest possible model first (popularity ranking) as a baseline. Establish the control, then iterate. My first version should have been "most popular offers" — not collaborative filtering.
Talk to the ops team before writing any code. They would have told me about the weekly new-offer cadence on day one. I discovered it on launch day.

This project gave me production scars that no textbook or course ever could. Every recommender system I've built since follows the same principle: measure first, build incrementally, and never trust your notebook.