Traditional e-commerce search is broken for complex purchase intent. When a user types "I need a lightweight laptop for video editing under $1500 with good battery life," keyword search returns noise. The user has to manually decompose their intent into individual filter selections — and most give up. Agentic commerce solves this by treating product discovery as a multi-turn reasoning problem.

At EPAM, I designed a multi-stage retrieval + ranking pipeline that enables conversational product discovery across 10K+ product lines. This post covers the architecture, the trade-offs between retrieval stages, and what actually works at scale.

Why Single-Shot LLM Fails for Product Discovery

The naive approach: dump the product catalog into the LLM context and let it reason. This fails for predictable reasons:

The Multi-Stage Pipeline Architecture

The key insight: separate fast retrieval from expensive reasoning. Each stage narrows the candidate set before the next stage does deeper analysis.

User Query
    ↓
[Stage 1] Embedding Retrieval (Top-100 candidates, ~50ms)
    ↓
[Stage 2] Intent Extraction (Parse constraints, categories, ~100ms)
    ↓
[Stage 3] LLM Reasoning + Re-ranking (Top-10 → Top-5, ~200ms)
    ↓
Personalized Results

Stage 1: Embedding Retrieval

Product descriptions are pre-embedded using a fine-tuned sentence transformer. At query time, the user's natural language query is embedded and we retrieve the top-100 most similar products via approximate nearest neighbor search. This stage is fast (~50ms) and catches semantically relevant products that keyword search would miss.

# Product embedding at index time
product_embedding = model.encode(f"{title} {description} {specs}")

# Query-time retrieval
query_embedding = model.encode(user_query)
candidates = vector_store.search(query_embedding, top_k=100)

Critical decisions here: the embedding model must understand both product language and purchase intent. We fine-tuned on (query, clicked_product) pairs from historical search logs — this dramatically improved retrieval relevance vs off-the-shelf models.

Stage 2: Intent Extraction

Before the LLM reasons about products, we extract structured intent from the user's query. This is a lightweight LLM call (or fine-tuned classifier) that parses:

# Intent extraction output
{
    "category": "laptop",
    "hard_constraints": {
        "price_max": 1500,
        "weight": "lightweight"
    },
    "soft_preferences": ["video editing", "good battery life"],
    "context_refs": []
}

Hard constraints are applied as filters on the candidate set (cheap operation). Soft preferences are passed to the LLM reasoning stage as ranking signals. This reduces the candidate set from 100 → ~20-30 products before the expensive LLM call.

Stage 3: LLM Reasoning + Re-ranking

The surviving candidates (20-30 products) are presented to the LLM along with the extracted intent and conversation history. The LLM performs:

"The re-ranking stage adds ~200ms of latency but lifts precision@5 by 35%. Without it, embedding retrieval returns semantically similar but not purchase-relevant results — a hiking shoe review is semantically close to hiking shoes, but it's not what the user wants to buy."

Conversational Context Management

Multi-turn conversations are where agentic commerce differentiates from traditional search. The system must maintain context across turns:

Turn 1: "I need a laptop for video editing"
→ Shows top-5 laptops with GPU focus

Turn 2: "Something cheaper"
→ Understands "cheaper" is relative to previous results
→ Filters by lower price range, maintains GPU preference

Turn 3: "How about this one but with more RAM?"
→ References a specific product from Turn 2
→ Retrieves similar products with more RAM

We maintain a sliding context window of the last 5 turns plus a summary of the user's evolving preferences. The intent extraction stage merges new signals with existing context — if the user said "lightweight" three turns ago and hasn't contradicted it, that preference persists.

Handling 10K+ Product Lines at Scale

Scale introduces specific challenges:

Evaluation: What We Measured

Traditional IR metrics (precision, recall, NDCG) are necessary but not sufficient. For conversational commerce, we also measure:

  1. Precision@5: Are the top-5 results all relevant? Our multi-stage pipeline achieves 0.78 vs 0.52 for keyword search.
  2. Intent satisfaction rate: Does the user find what they want within 3 turns? Measured via click-through and cart-add rates.
  3. Conversation depth: Average turns per session. Higher is good (engagement) up to a point — beyond 5 turns, the system is failing to understand intent.
  4. Fallback rate: How often does the system say "I can't find a match"? Too high means poor retrieval; too low means it's hallucinating matches.

Lessons from Production

  1. Start with retrieval quality. If Stage 1 misses the right products, no amount of LLM reasoning can fix it. Invest heavily in embedding quality and retrieval coverage.
  2. Intent extraction is the hardest stage. Users express purchase intent in wildly different ways. "I need something for my mom" requires reasoning about demographics, occasion, and price sensitivity — all implicit.
  3. Re-ranking is worth the latency. The 200ms cost of LLM re-ranking is repaid many times over in conversion improvement. Embedding similarity alone is not enough for purchase relevance.
  4. Conversation context has diminishing returns. Beyond 5 turns of history, the LLM starts getting confused by contradictory signals. Summarize aggressively.
  5. Monitor hallucinated products. Even with grounding, LLMs occasionally describe features that the actual product doesn't have. Automated validation against the product database catches these before they reach users.
"Agentic commerce isn't about replacing search — it's about augmenting it with reasoning. The best results come from combining fast, reliable retrieval with slow, thoughtful reasoning. The multi-stage pipeline makes this economically viable at scale."