Agentic Commerce: Multi-Stage Retrieval for Conversational Product Discovery

Traditional e-commerce search is broken for complex purchase intent. When a user types "I need a lightweight laptop for video editing under $1500 with good battery life," keyword search returns noise. The user has to manually decompose their intent into individual filter selections — and most give up. Agentic commerce solves this by treating product discovery as a multi-turn reasoning problem.

At EPAM, I designed a multi-stage retrieval + ranking pipeline that enables conversational product discovery across 10K+ product lines. This post covers the architecture, the trade-offs between retrieval stages, and what actually works at scale.

Why Single-Shot LLM Fails for Product Discovery

The naive approach: dump the product catalog into the LLM context and let it reason. This fails for predictable reasons:

Context window limits: 10K+ products exceed any reasonable context window. Even with summarization, the LLM can't process the full catalog.
Hallucinated products: Without grounding in the actual catalog, LLMs confidently recommend products that don't exist or mix attributes from different items.
Latency: Processing thousands of product descriptions per query is too slow for conversational UX — users expect sub-second responses.
Cost: Token costs scale linearly with catalog size. At 10K+ products, every query becomes expensive.

The Multi-Stage Pipeline Architecture

The key insight: separate fast retrieval from expensive reasoning. Each stage narrows the candidate set before the next stage does deeper analysis.

User Query
    ↓
[Stage 1] Embedding Retrieval (Top-100 candidates, ~50ms)
    ↓
[Stage 2] Intent Extraction (Parse constraints, categories, ~100ms)
    ↓
[Stage 3] LLM Reasoning + Re-ranking (Top-10 → Top-5, ~200ms)
    ↓
Personalized Results

Stage 1: Embedding Retrieval

Product descriptions are pre-embedded using a fine-tuned sentence transformer. At query time, the user's natural language query is embedded and we retrieve the top-100 most similar products via approximate nearest neighbor search. This stage is fast (~50ms) and catches semantically relevant products that keyword search would miss.

# Product embedding at index time
product_embedding = model.encode(f"{title} {description} {specs}")

# Query-time retrieval
query_embedding = model.encode(user_query)
candidates = vector_store.search(query_embedding, top_k=100)

Critical decisions here: the embedding model must understand both product language and purchase intent. We fine-tuned on (query, clicked_product) pairs from historical search logs — this dramatically improved retrieval relevance vs off-the-shelf models.

Stage 2: Intent Extraction

Before the LLM reasons about products, we extract structured intent from the user's query. This is a lightweight LLM call (or fine-tuned classifier) that parses:

Hard constraints: Price range, brand, specific features ("under $1500", "USB-C")
Soft preferences: Use case, priority signals ("for video editing", "lightweight")
Category signals: Product type ("laptop", "camera")
Conversational context: References to previous turns ("something similar but cheaper")

# Intent extraction output
{
    "category": "laptop",
    "hard_constraints": {
        "price_max": 1500,
        "weight": "lightweight"
    },
    "soft_preferences": ["video editing", "good battery life"],
    "context_refs": []
}

Hard constraints are applied as filters on the candidate set (cheap operation). Soft preferences are passed to the LLM reasoning stage as ranking signals. This reduces the candidate set from 100 → ~20-30 products before the expensive LLM call.

Stage 3: LLM Reasoning + Re-ranking

The surviving candidates (20-30 products) are presented to the LLM along with the extracted intent and conversation history. The LLM performs:

Relevance scoring: How well does each product match the user's complete intent?
Trade-off reasoning: "This laptop has better battery but is slightly heavier — mentioning the trade-off helps the user decide"
Explanation generation: Why each recommendation was selected, in natural language
Re-ranking: Final ordering based on holistic relevance, not just embedding similarity

"The re-ranking stage adds ~200ms of latency but lifts precision@5 by 35%. Without it, embedding retrieval returns semantically similar but not purchase-relevant results — a hiking shoe review is semantically close to hiking shoes, but it's not what the user wants to buy."

Conversational Context Management

Multi-turn conversations are where agentic commerce differentiates from traditional search. The system must maintain context across turns:

Turn 1: "I need a laptop for video editing"
→ Shows top-5 laptops with GPU focus

Turn 2: "Something cheaper"
→ Understands "cheaper" is relative to previous results
→ Filters by lower price range, maintains GPU preference

Turn 3: "How about this one but with more RAM?"
→ References a specific product from Turn 2
→ Retrieves similar products with more RAM

We maintain a sliding context window of the last 5 turns plus a summary of the user's evolving preferences. The intent extraction stage merges new signals with existing context — if the user said "lightweight" three turns ago and hasn't contradicted it, that preference persists.

Handling 10K+ Product Lines at Scale

Scale introduces specific challenges:

Index freshness: Products change daily — prices, availability, new arrivals. We use incremental embedding updates (only re-embed changed products) rather than full re-indexing.
Category diversity: A single embedding space for electronics, clothing, and food products leads to poor retrieval. We use category-aware embeddings with a shared base model + category-specific adapters.
Cold-start products: New products with no interaction data get initial embeddings from their description + specs. As user interactions accumulate, embeddings are enriched with behavioral signals.
Latency budget: Total pipeline must stay under 500ms for conversational UX. Each stage has a strict time budget: retrieval (50ms), intent (100ms), LLM reasoning (300ms).

Evaluation: What We Measured

Traditional IR metrics (precision, recall, NDCG) are necessary but not sufficient. For conversational commerce, we also measure:

Precision@5: Are the top-5 results all relevant? Our multi-stage pipeline achieves 0.78 vs 0.52 for keyword search.
Intent satisfaction rate: Does the user find what they want within 3 turns? Measured via click-through and cart-add rates.
Conversation depth: Average turns per session. Higher is good (engagement) up to a point — beyond 5 turns, the system is failing to understand intent.
Fallback rate: How often does the system say "I can't find a match"? Too high means poor retrieval; too low means it's hallucinating matches.

Lessons from Production

Start with retrieval quality. If Stage 1 misses the right products, no amount of LLM reasoning can fix it. Invest heavily in embedding quality and retrieval coverage.
Intent extraction is the hardest stage. Users express purchase intent in wildly different ways. "I need something for my mom" requires reasoning about demographics, occasion, and price sensitivity — all implicit.
Re-ranking is worth the latency. The 200ms cost of LLM re-ranking is repaid many times over in conversion improvement. Embedding similarity alone is not enough for purchase relevance.
Conversation context has diminishing returns. Beyond 5 turns of history, the LLM starts getting confused by contradictory signals. Summarize aggressively.
Monitor hallucinated products. Even with grounding, LLMs occasionally describe features that the actual product doesn't have. Automated validation against the product database catches these before they reach users.

"Agentic commerce isn't about replacing search — it's about augmenting it with reasoning. The best results come from combining fast, reliable retrieval with slow, thoughtful reasoning. The multi-stage pipeline makes this economically viable at scale."