Forecasting Models That Actually Work in Production

Every data science team eventually needs to forecast something — sales, demand, workforce headcount, server load. And every team eventually discovers the same painful truth: the model that wins on Kaggle usually fails in production.

I've deployed forecasting systems at Amazon (workforce planning automation), EPAM (multi-market demand), and most recently led the Bayesian sales forecasting platform at RBI International across 5 European markets. This post covers what actually works, what doesn't, and how Bayesian structural modeling changed the game.

The Forecasting Landscape

There are three main families of time series forecasting models you'll encounter in production:

Statistical models: ARIMA, ETS, Theta — mathematically elegant, well-understood, but rigid.
Decomposition models: Prophet, STL — break time series into trend, seasonality, and residual components. Flexible and interpretable.
Machine learning models: XGBoost, LightGBM with lag features, or deep learning approaches like N-BEATS and Temporal Fusion Transformers.
Bayesian structural models: PyMC, Stan — encode domain knowledge as priors, decompose effects explicitly, and quantify uncertainty natively. The production powerhouse when interpretability and causal reasoning matter.

ARIMA: The Classic That Struggles

ARIMA (AutoRegressive Integrated Moving Average) is the textbook starting point. It models the time series as a function of its own past values and past forecast errors:

ARIMA(p, d, q):
  p = number of autoregressive terms
  d = degree of differencing (for stationarity)
  q = number of moving average terms

Example: ARIMA(2, 1, 1)
  y'(t) = c + φ₁y'(t-1) + φ₂y'(t-2) + θ₁ε(t-1) + ε(t)

When it works: Univariate series with stable patterns, no external regressors needed, short-term forecasts (1–4 steps ahead).

When it fails: Multiple seasonalities (daily + weekly + yearly), missing data, outliers, or when business context matters (holidays, promotions, events). Basically, most real-world scenarios.

The biggest operational pain with ARIMA is parameter selection. Auto-ARIMA helps, but it's slow and doesn't always converge. When you're forecasting 500+ time series across multiple markets, ARIMA's manual tuning becomes untenable.

Prophet: The 80/20 Solution

Facebook Prophet (now Meta Prophet) took a different approach. Instead of modeling autocorrelation directly, it decomposes the time series into additive (or multiplicative) components:

y(t) = g(t) + s(t) + h(t) + ε(t)

where:
  g(t) = trend (piecewise linear or logistic growth)
  s(t) = seasonality (Fourier series)
  h(t) = holiday/event effects
  ε(t) = error term

This decomposition is what makes Prophet production-friendly. Business stakeholders understand "trend is going up, there's a weekly pattern, and Thanksgiving causes a spike." Try explaining an ARIMA(2,1,1) to a VP.

"The best forecasting model is the one your stakeholders trust enough to act on. Interpretability isn't optional — it's a deployment requirement."

Why We Chose Prophet at Amazon

For our workforce forecasting automation at Amazon, we chose Prophet for several practical reasons:

Handles missing data gracefully. Amazon fulfillment center data has gaps — holidays, shutdowns, system outages. Prophet just works through them.
Multiple seasonality. Weekly, monthly, and yearly patterns all captured with Fourier components. This was critical for workforce planning.
Changepoint detection. When Covid hit, demand patterns shifted overnight. Prophet's automatic changepoint detection adapted without manual intervention.
Regressors. We added external variables like planned promotions and scheduled events as additional regressors.

Bayesian Optimization: The Secret Sauce

The model alone wasn't enough. Prophet has ~15 tunable hyperparameters (changepoint_prior_scale, seasonality_prior_scale, fourier_order, etc.), and the defaults don't always work.

Grid search across 500+ time series was computationally infeasible. Random search was wasteful. Bayesian Optimization was the answer.

# Bayesian Optimization with Optuna
import optuna

def objective(trial):
    params = {
        'changepoint_prior_scale': trial.suggest_float(
            'changepoint_prior_scale', 0.001, 0.5, log=True
        ),
        'seasonality_prior_scale': trial.suggest_float(
            'seasonality_prior_scale', 0.01, 10, log=True
        ),
        'holidays_prior_scale': trial.suggest_float(
            'holidays_prior_scale', 0.01, 10, log=True
        ),
        'seasonality_mode': trial.suggest_categorical(
            'seasonality_mode', ['additive', 'multiplicative']
        ),
    }
    model = Prophet(**params)
    model.fit(train_df)
    forecast = model.predict(val_df)
    return mean_absolute_percentage_error(val_df['y'], forecast['yhat'])

Bayesian Optimization builds a surrogate model (typically a Gaussian Process) that maps hyperparameters to performance. Each iteration balances exploring unknown regions and exploiting known good regions. In practice, it found optimal parameters in 30–50 trials versus 500+ for grid search.

The result: 96% reduction in manual workforce planning effort at Amazon, with forecast accuracy that stakeholders trusted enough to automate decisions on.

Multi-Market Forecasting at EPAM

At EPAM, the challenge was different: forecast demand across 5+ restaurant markets and 10 limited-time offers (LTOs) simultaneously. Each market had different seasonality patterns, different holiday calendars, and different customer behavior.

We built a Streamlit-based forecasting platform that:

Pulled historical data from Snowflake per market/LTO combination
Automatically tuned Prophet hyperparameters using BayesOpt per series
Generated forecasts with uncertainty intervals
Tracked model performance over time with MLflow
Let business users adjust assumptions via a visual interface

The "let business users adjust" part was critical. No forecast is perfect. Giving stakeholders a way to incorporate their domain knowledge (e.g., "we know this LTO will have a bigger marketing push than last time") dramatically increased adoption. The result: 23% improvement in forecast accuracy.

Bayesian Sales Forecasting at Scale: RBI International

The most technically ambitious forecasting system I've led was the daily sales forecasting platform for a global QSR brand — a Bayesian structural model predicting daily comparable sales across multiple European markets. This wasn't just a model; it was a full production system that finance, marketing, and operations teams relied on for real-time decision-making.

Why Bayesian? Why Not Prophet or XGBoost?

The core problem demanded something Prophet couldn't deliver: explicit causal decomposition of activation (promotion) effects with cannibalization modeling. When a restaurant chain runs a promotional deal, it drives traffic to a specific menu segment — but it also cannibalizes baseline sales. We needed a model that could separately estimate both effects and let business users adjust individual activation assumptions for future forecasts.

We chose PyMC with Maximum A Posteriori (MAP) inference to build a multi-headed probabilistic model that decomposes daily log-sales into interpretable additive components:

log(sales) = α_segment          # per-segment intercepts
           + β_slope · t         # per-segment trends
           + β_seasonality · S(t) # Fourier seasonality
           + β_holiday · H(t)    # holiday effects
           + β_act_segment · A(t) # activation → segment effects
           + β_act_baseline · B(t) # activation → baseline (cannibalization)
           + ε

Each component has a clear business interpretation. The activation-to-baseline coefficients directly measure cannibalization — how much a promotion eats into organic sales. The activation-to-segment coefficients measure the lift. This dual-headed structure is what makes the model uniquely useful: you can estimate net incrementality per promotion, not just gross lift.

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Snowflake Data Platform                     │
│  ┌──────────────────┐  ┌───────────────────┐  ┌──────────────┐  │
│  │   Daily Sales    │  │ Promotion         │  │ Holiday      │  │
│  │                  │  │ Calendar          │  │ Calendar     │  │
│  └────────┬─────────┘  └────────┬──────────┘  └──────┬───────┘  │
│           └──────────────┬──────┘                    │          │
│                    ┌─────▼─────┐                     │          │
│                    │ Feature   │◄────────────────────┘          │
│                    │ Store     │                                │
│                    └─────┬─────┘                                │
├──────────────────────────┼──────────────────────────────────────┤
│              ┌───────────▼───────────┐                          │
│              │  PyMC MAP Model       │                          │
│              │  (per-market)         │                          │
│              │  • Segment intercepts │                          │
│              │  • Fourier seasonality│                          │
│              │  • Holiday effects    │                          │
│              │  • Activation effects │                          │
│              │  • Cannibalization    │                          │
│              └───────────┬───────────┘                          │
│    ┌─────────────────────┼─────────────────────┐                │
│    ▼                     ▼                     ▼                │
│ Model              Forecast             Activation              │
│ Coefficients       Results              Effect Tables           │
├─────────────────────────────────────────────────────────────────┤
│                  Streamlit Dashboard                            │
│  ┌────────────┐  ┌──────────────┐  ┌───────────────────────┐    │
│  │ Main       │  │ Activations  │  │ Model Coefficients    │    │
│  │ Dashboard  │  │ Deep Dive    │  │ (Adjust & Re-forecast)│    │
│  └────────────┘  └──────────────┘  └───────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Key Technical Decisions

MAP instead of full MCMC. With multiple markets and dozens of activations per market, full posterior sampling was too slow for a system that needed to retrain weekly. MAP inference gave us point estimates with prior regularization — fast enough for production, principled enough for the statistics to hold.
Separate baseline modeling. The baseline segment (organic sales with no active promotion) gets its own intercept with a scaling factor to make cannibalization effects more pronounced. This was a deliberate modeling choice — business stakeholders wanted to see worst-case cannibalization to make conservative promotion decisions.
User-adjustable coefficients. The Streamlit dashboard lets marketing teams override activation strength and incrementality estimates for future promotions. These adjustments are versioned in Snowflake and automatically consolidated with model outputs on the next training run.
Pandera schema validation. Every data handoff between repositories, transformations, and the dashboard is validated with strict Pandera schemas. No silent data corruption. When activation calendars changed upstream, we caught it immediately instead of producing garbage forecasts.

The Activation Coefficient Pipeline

The most intricate part of the system is how activation effects flow from raw model coefficients to business-facing metrics:

The PyMC model outputs raw coefficients encoding each activation's effect type and name in a structured naming convention.
The pipeline parses these, pivots by coefficient type (baseline effect vs segment effect), and merges with the promotion calendar to get date ranges.
Business metrics are derived: Estimated Activation Strength = exp(segment effect), Estimated Incrementality = (1 + baseline effect) × 100.
User adjustments are layered on top via a merge-edit-delete paradigm, stored as versioned history.
On the next model run, the consolidated table reconciles model estimates with user overrides — model wins where data exists, user inputs preserved for future activations.

Results

The system achieved a 23% improvement in forecast accuracy over the previous approach, with comparable sales decomposition that finance teams used directly in monthly reporting. The interactive adjustment capability increased stakeholder adoption significantly — marketing teams went from ignoring model outputs to actively using them for promotion planning across all markets.

"The model that wins isn't the most accurate on backtests — it's the one that marketing, finance, and ops all trust enough to change their decisions. Bayesian decomposition gave us the interpretability to earn that trust."

When to Use What: A Decision Framework

ARIMA/ETS: Single univariate series, clean data, short horizon, statistical rigor needed. Good for benchmarking.
Prophet: Multiple seasonalities, missing data, holidays matter, interpretability needed, dozens to hundreds of series. The production workhorse.
Bayesian Structural (PyMC/Stan): You need causal decomposition, explicit priors from domain knowledge, cannibalization or cross-effect modeling, and user-adjustable assumptions. Best when stakeholder trust matters as much as accuracy.
ML Models (XGBoost/LightGBM): Rich exogenous features, cross-series learning possible, you have feature engineering expertise. Often wins on accuracy but loses on interpretability.
Deep Learning (N-BEATS, TFT): Thousands of related series, massive data, GPU budget available, and you need to learn cross-series patterns automatically. Overkill for most business forecasting.

Common Production Pitfalls

Overfitting on training data. An MAPE of 2% on training data means nothing if validation MAPE is 15%. Always use time-based cross-validation, never random splits.
Ignoring uncertainty. Point forecasts without confidence intervals are dangerous. Prophet's built-in uncertainty quantification was one of its biggest selling points for us.
Not monitoring forecast drift. Production forecasts degrade over time as patterns change. We set up automated alerts when rolling MAPE exceeded 1.5x the baseline.
Forecasting too far ahead. Accuracy degrades exponentially with horizon length. A 4-week workforce forecast is useful; a 6-month one is fiction. Know your model's reliable horizon.
Ignoring cannibalization. Promotions don't just add sales — they shift them between segments. If your model can't decompose gross lift from net incrementality, you're overstating promotion ROI.

Conclusion

The lesson from deploying forecasting systems across three companies is simple: production constraints dominate model choice. The best forecasting model isn't the one with the lowest error on a test set — it's the one that handles your data's messiness, scales to your number of series, produces interpretable results, and runs reliably every week without manual intervention.

For most business forecasting problems, Prophet with Bayesian hyperparameter tuning is the sweet spot. But when you need causal decomposition, explicit promotion effect modeling, and user-adjustable assumptions — a Bayesian structural model with PyMC is worth the extra engineering investment. It's the difference between a model that predicts and a system that stakeholders actually use to make decisions.

Up next: ML System Design — the architecture patterns that keep AI systems alive in production.