Building Blocks of LLMs
A comprehensive technical visual guide for LLMOps professionals. Explore the mathematical DNA and architectural anatomy of modern AI.
0 Recap: The Foundation
Before the “brain” can process information, it must translate human symbols into machine vectors.
The Mathematical Result
By the end, text becomes a matrix $X \in \mathbb{R}^{n \times d_{model}}$, ready for processing.
1 The Attention Mechanism
2 Dummy Example
Q1=[1,0], K1=[1,0], V1=[10,0]
// Token 2 (“River”)
Q2=[1,1], K2=[0,1], V2=[0,10]
The calculation resolves ambiguity. The output for “Bank” absorbs 33% of the “River” context, shifting its meaning in high-dimensional space.
4 Rotary Positional (RoPE)
6 The Transformer Block
The fundamental repeating unit of an LLM.
- Attention: The “social” communication layer.
- FFN: The “knowledge” storage layer (Facts).
- Layer Norm: The statistical “stabilizer”.
- Residuals: The “Express Lane” for signals.
7 Mixture of Experts (MoE)
Allows massive models (like Mixtral) to run at the speed of small ones by only activating specific parameters per token.
L Logits & Decoding Strategies
After the layers process the vector, we have to pick a word. The model outputs “Logits” (raw scores), which we turn into probabilities via Softmax.
Decoding Comparison Table
| Strategy | Mechanism | Best For… | The “Vibe” |
|---|---|---|---|
| Greedy | Pick the #1 highest prob. | Math, Coding, Facts | Deterministic / Rigid |
| Top-K | Pick from Top K words | General Chat | Balanced |
| Nucleus (Top-p) | Pick from top P% of mass | Creative Writing | Dynamic / Human-like |
| Beam Search | Track multiple paths | Translation, Summaries | Thorough / Optimized |
M Hardware-Software Mapping
Connecting architectural needs to production infrastructure.
| Constraint | Architecture Block | Hardware Need | Software Solution |
|---|---|---|---|
| VRAM (Capacity) | Model Weights / MoE | A100/H100 (80GB+) | Quantization (FP8/4-bit) |
| Compute (Flops) | Dense Attention layers | Tensor Cores (NVIDIA) | FlashAttention-3 |
| Memory Bandwidth | Token Generation (TPOT) | HBM3e Memory | Speculative Decoding |
| Fragmentation | KV Cache Storage | High-speed SRAM | PagedAttention (vLLM) |
8 The Training Lifecycle
1. Pretraining (Foundation)
Learning “how” to speak by reading the internet. Massive scale, self-supervised.
2. SFT (Fine-Tuning)
Instruction following. Learning to behave like an assistant with curated data.
3. RLHF / DPO (Alignment)
Fine-tuning for human preference. Safety, style, and helpfulness filters.
10. Summary for LLMOps
VRAM Usage
Divided between Weights (long-term) and KV Cache (short-term memory). Manage with PagedAttention.
Inference Latency
TTFT: Time to first token (Reading prompt).
TPS: Tokens per second (Generating response).
Optimization
Continuous Batching and Quantization (INT8/FP8) are essential for production scale.