Large Language Models have fundamentally changed how we build AI systems. But beneath the hype lies elegant engineering — attention mechanisms, positional encodings, KV-caches, and inference optimization that make these models work at scale. Most practitioners use LLMs as black boxes. The engineers who understand the internals build dramatically better systems.

In this post, I'll break down every core architecture component that powers models like GPT-4, LLaMA 3, Claude, and Mistral — and share practical insights from deploying LLMs in production at EPAM, where I built RAG-powered policy generators and agentic AI workflows.

The Transformer: Why It Won

The transformer architecture, introduced in the landmark "Attention is All You Need" (Vaswani et al., 2017), replaced recurrence with self-attention. Before transformers, RNNs and LSTMs processed sequences one token at a time — a fundamental bottleneck for parallelism. Transformers process all tokens simultaneously.

"The key insight of transformers is that you don't need to process sequences sequentially. Every token can attend to every other token simultaneously. This single change unlocked training on billions of parameters."

A transformer block consists of two sub-layers: multi-head self-attention and a position-wise feed-forward network. Each sub-layer has a residual connection and layer normalization. Stack 32–128 of these blocks, and you get a modern LLM.

Transformer Block:
  Input → LayerNorm → Multi-Head Attention → + (residual)
        → LayerNorm → Feed-Forward Network  → + (residual)
        → Output

Modern LLMs (LLaMA, GPT):
  - Pre-norm (LayerNorm before attention, not after)
  - RMSNorm instead of LayerNorm (faster, equally effective)
  - SwiGLU activation in FFN (replaces ReLU)
  - Grouped Query Attention (reduces KV-cache memory)

Tokenization: Where It All Starts

Before a single attention computation happens, text must be converted into tokens. This step is deceptively important — tokenization quality directly affects model performance, cost, and latency.

Modern LLMs use subword tokenization algorithms:

Why this matters in production: token count = cost. A poorly tokenized prompt with 2000 tokens might only need 1200 with a better tokenizer. At scale, that's a 40% cost reduction. When we built the RAG policy generator at EPAM, optimizing prompt tokenization saved meaningful API costs.

Multi-Head Attention: The Core Mechanism

Self-attention is what gives transformers their power. For each token, the model computes how much it should "attend to" every other token in the sequence. The math:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

where:
  Q = query matrix  (what am I looking for?)
  K = key matrix    (what do I contain?)
  V = value matrix  (what information do I provide?)
  d_k = key dimension (scaling factor to prevent gradient vanishing)

Multi-head attention runs this computation multiple times in parallel with different learned projections. Each "head" learns to attend to different types of relationships:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O

where head_i = Attention(Q·W_Q_i, K·W_K_i, V·W_V_i)

GPT-4 (estimated):  96 heads, d_model = 12288
LLaMA 3 70B:        64 heads, d_model = 8192
Mistral 7B:         32 heads, d_model = 4096

Grouped Query Attention (GQA)

Standard multi-head attention uses separate K and V projections per head. This means the KV-cache grows linearly with the number of heads — a major memory bottleneck during inference.

Grouped Query Attention (used by LLaMA 2/3, Mistral) shares K and V projections across groups of query heads. LLaMA 3 70B uses 64 query heads but only 8 KV heads — an 8x reduction in KV-cache memory with minimal quality loss.

Positional Encoding: Teaching Order

Self-attention is permutation-invariant — it doesn't know token order. The model can't distinguish "dog bites man" from "man bites dog." Positional encodings solve this.

Absolute Positional Encoding (Original Transformer)

The original paper used sinusoidal functions of different frequencies added to token embeddings. Simple, but limited — the model can't generalize to sequences longer than it was trained on.

Rotary Position Embedding (RoPE)

Modern LLMs use RoPE (Rotary Position Embedding), which encodes position by rotating the query and key vectors in pairs of dimensions. The key property: the attention score between two tokens depends only on their relative distance, not absolute position.

RoPE: rotate Q and K by position-dependent angles

Benefits:
  - Relative position awareness (distance-based attention decay)
  - Extrapolates to longer sequences than training length
  - Compatible with linear attention approximations
  
Used by: LLaMA, Mistral, Qwen, PaLM, Gemma

RoPE's extrapolation ability is why techniques like YaRN and NTK-aware scaling can extend context windows to 128K+ tokens without retraining from scratch — just by adjusting the rotation frequencies.

The Feed-Forward Network: Where Knowledge Lives

After attention, each token passes through a position-wise feed-forward network (FFN). This is where a large portion of the model's "knowledge" is stored — factual information, world knowledge, and learned patterns are encoded in the FFN weight matrices.

Original: FFN(x) = ReLU(xW₁ + b₁)W₂ + b₂

Modern (SwiGLU, used in LLaMA/Mistral):
  FFN(x) = (Swish(xW₁) ⊙ xW₃)W₂

  - ⊙ = element-wise multiplication (gating)
  - Swish(x) = x · sigmoid(x)
  - W₃ adds a gating mechanism for better gradient flow

The FFN typically has a hidden dimension 4x the model dimension. For LLaMA 3 70B with d_model=8192, the FFN hidden dimension is 28672 — each FFN layer has over 470 million parameters. This is where most of the parameters (and compute) live.

Scaling Laws: When to Use What Size

Kaplan et al. (2020) and the Chinchilla paper (Hoffmann et al., 2022) established that LLM performance follows predictable power laws based on three variables: parameters, data, and compute.

"At EPAM, our RAG-powered IAM policy generator uses GPT-4 for complex multi-cloud policies but routes simple single-service policies to a fine-tuned smaller model. This model routing saved 60% on API costs while maintaining quality."

KV-Cache: The Inference Bottleneck

During autoregressive generation, the model generates one token at a time. For each new token, it needs to compute attention over all previous tokens. Without caching, this means re-computing the key and value projections for the entire sequence at every step — O(n²) computation for n tokens.

The KV-cache stores previously computed key and value tensors so they don't need to be recomputed. This reduces per-step computation to O(n) but introduces a memory problem: the cache grows with sequence length × batch size × number of layers × number of heads.

KV-Cache Memory per request:
  = 2 × num_layers × num_kv_heads × d_head × seq_len × bytes_per_param

Example (LLaMA 3 70B, 4K context, FP16):
  = 2 × 80 × 8 × 128 × 4096 × 2 bytes
  = ~1.3 GB per request

With GQA (8 KV heads instead of 64):
  = ~1.3 GB (vs ~10.5 GB with standard MHA)

This is why GQA matters — it's the difference between serving
8 concurrent requests vs 1 on a single GPU.

KV-Cache Optimization Techniques

Quantization: Shrinking Models for Production

A 70B parameter model in FP16 requires 140 GB of GPU memory — that's two A100 80GB GPUs just to load the weights. Quantization reduces precision to make models fit on fewer (or cheaper) GPUs.

Memory requirements for LLaMA 3 70B:
  FP32:  280 GB  (impractical)
  FP16:  140 GB  (2× A100 80GB)
  INT8:   70 GB  (1× A100 80GB)
  INT4:   35 GB  (1× A100 40GB or 2× RTX 4090)

Production recommendation:
  - Serving: INT8 for quality-sensitive tasks, INT4 for throughput
  - Fine-tuning: Always in FP16/BF16 (quantization hurts gradients)

Serving Infrastructure: Getting to Production

The model is maybe 30% of the production LLM story. The rest is serving infrastructure. Here are the patterns I've used:

vLLM: The Production Standard

vLLM has become the de facto serving engine for most LLM deployments. Its key innovations:

Speculative Decoding

Autoregressive generation is inherently sequential — you can't generate token N+1 without token N. Speculative decoding works around this by using a small "draft" model to generate K candidate tokens quickly, then verifying them in parallel with the large model.

When the draft model's guesses are correct (which happens 60-80% of the time for well-matched models), you get K tokens for the cost of one large model forward pass. In practice, this yields 2-3x latency improvement for time-to-first-token.

Model Routing

Not every request needs GPT-4. A production LLM system should route requests to the right model:

Request Router Architecture:
  User Query
    → Complexity Classifier (lightweight model)
      → Simple: Route to 7B fine-tuned model (~$0.001/request)
      → Medium: Route to 70B model (~$0.01/request)
      → Complex: Route to GPT-4 / Claude (~$0.05/request)

This saved us ~60% on API costs at EPAM.
The classifier itself is a fine-tuned DistilBERT — 
sub-millisecond inference, trivial cost.

RAG: Grounding LLMs in Your Data

Retrieval-Augmented Generation is how you make LLMs useful for domain-specific tasks without retraining. The architecture:

  1. Indexing: Chunk your documents, embed them with a model like BGE or E5, store in a vector database (ChromaDB, Pinecone, Weaviate).
  2. Retrieval: At query time, embed the user's question, find the K most similar chunks via approximate nearest neighbor search.
  3. Generation: Concatenate retrieved chunks into the prompt context and let the LLM generate a grounded answer.

The critical engineering decisions: chunk size (too small loses context, too large dilutes relevance), embedding model choice (domain-specific vs general-purpose), and reranking (a cross-encoder reranker after initial retrieval dramatically improves precision).

"For our IAM policy generator, we found that 512-token chunks with 50-token overlap and a BGE-reranker gave the best policy accuracy. Larger chunks included too much irrelevant AWS documentation; smaller chunks lost the context needed for multi-service policies."

Fine-Tuning: When RAG Isn't Enough

RAG handles factual grounding, but sometimes you need the model to learn a new behavior — a specific output format, domain jargon, or reasoning pattern. That's where fine-tuning comes in.

LoRA Configuration (what worked for us):
  rank: 16-64 (higher = more capacity, more memory)
  alpha: 32 (scaling factor, typically 2x rank)
  target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
  learning_rate: 2e-4
  epochs: 3-5 (watch for overfitting on small datasets)
  
Dataset size guidelines:
  - Instruction tuning: 1K-10K examples
  - Domain adaptation: 10K-100K examples
  - Style/format tuning: 500-2K examples

At Amazon, we fine-tuned BERT (a smaller transformer) for employee sentiment analysis. The key lesson: data quality matters more than data quantity. 2000 carefully curated, domain-specific examples outperformed 20000 noisy examples from a general sentiment dataset.

The Production Checklist

After deploying LLMs across multiple systems, here's the checklist I run through before any production launch:

  1. Guardrails: Input validation, output filtering, PII detection. LLMs will generate harmful content if you don't constrain them. Use tools like NeMo Guardrails or build custom classifiers.
  2. Rate limiting: LLM inference is expensive. Rate limit per user, per API key, and set hard cost ceilings.
  3. Fallbacks: When the LLM API is down (and it will be), what happens? Have a cached response, a simpler model, or a graceful degradation path.
  4. Observability: Log every request/response (redacted for PII), track latency percentiles, monitor token usage, alert on quality regression.
  5. Evaluation: Automated eval suites that run on every model update. Human eval for edge cases. Never deploy a model you haven't evaluated systematically.
  6. Cost tracking: Token-level cost attribution per feature/user. It's easy to burn $10K/month on LLM APIs without realizing it.

Conclusion

Understanding LLM architecture isn't just academic — it directly informs how you deploy, optimize, and debug these systems in production. Knowing why GQA reduces memory tells you when to trade off quality for throughput. Understanding KV-cache tells you why long prompts are expensive. Knowing RoPE tells you why context extension works.

The engineers who understand the internals don't just use LLMs — they build systems that use LLMs efficiently, reliably, and cost-effectively. That's the difference between a demo and a product.

Next up: Building Recommendation Systems That Actually Scale — from matrix factorization to deep learning at restaurant scale.