TechSkills of Future

LLM Building Blocks: Attention & Transformers

Building Blocks of LLMs: A Comprehensive Guide

Building Blocks of LLMs

A comprehensive technical visual guide for LLMOps professionals. Explore the mathematical DNA and architectural anatomy of modern AI.

0 Recap: The Foundation

Before the “brain” can process information, it must translate human symbols into machine vectors.

Analogy: Like breaking a LEGO castle into individual bricks. We use algorithms like BPE or WordPiece to make sure even rare words (like “unbelievable”) are broken into pieces.
un be liev able
1. BPE Tokenization
2. Vector Embeddings
3. Positional Encoding

The Mathematical Result

By the end, text becomes a matrix $X \in \mathbb{R}^{n \times d_{model}}$, ready for processing.

1 The Attention Mechanism

The “Library Search” Analogy: Imagine searching for “Jaguar”. Query ($Q$) is your intent (Animal or Car?). Key ($K$) is the book spine label. Value ($V$) is the wisdom inside.
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
X Q K V
4. Matrix Projections
High Weight
5. Attention Scores

2 Dummy Example

// Token 1 (“Bank”)
Q1=[1,0], K1=[1,0], V1=[10,0]

// Token 2 (“River”)
Q2=[1,1], K2=[0,1], V2=[0,10]

The calculation resolves ambiguity. The output for “Bank” absorbs 33% of the “River” context, shifting its meaning in high-dimensional space.

Result: [6.7, 3.3] (River-Bank Context)

4 Rotary Positional (RoPE)

The Clock Analogy: Instead of adding numbers, RoPE “rotates” vectors. Distance is the angle between them.
θ
6. Complex Rotation (RoPE)

6 The Transformer Block

The fundamental repeating unit of an LLM.

Multi-Head Attention LayerNorm + Add Feed Forward (FFN) LayerNorm + Add
7. Layer Architecture
  • Attention: The “social” communication layer.
  • FFN: The “knowledge” storage layer (Facts).
  • Layer Norm: The statistical “stabilizer”.
  • Residuals: The “Express Lane” for signals.

7 Mixture of Experts (MoE)

Router Expert A Expert B
8. Sparse Gating
The “Hospital” Analogy: In a Dense model, you see every doctor. In MoE, a Router (Triage Nurse) sends a “Physics” token only to the Physics Experts.

Allows massive models (like Mixtral) to run at the speed of small ones by only activating specific parameters per token.

L Logits & Decoding Strategies

After the layers process the vector, we have to pick a word. The model outputs “Logits” (raw scores), which we turn into probabilities via Softmax.

Word Scores (Logits) Softmax
9. The Logit Pipeline

Decoding Comparison Table

Strategy Mechanism Best For… The “Vibe”
Greedy Pick the #1 highest prob. Math, Coding, Facts Deterministic / Rigid
Top-K Pick from Top K words General Chat Balanced
Nucleus (Top-p) Pick from top P% of mass Creative Writing Dynamic / Human-like
Beam Search Track multiple paths Translation, Summaries Thorough / Optimized

M Hardware-Software Mapping

Connecting architectural needs to production infrastructure.

Constraint Architecture Block Hardware Need Software Solution
VRAM (Capacity) Model Weights / MoE A100/H100 (80GB+) Quantization (FP8/4-bit)
Compute (Flops) Dense Attention layers Tensor Cores (NVIDIA) FlashAttention-3
Memory Bandwidth Token Generation (TPOT) HBM3e Memory Speculative Decoding
Fragmentation KV Cache Storage High-speed SRAM PagedAttention (vLLM)

8 The Training Lifecycle

1. Pretraining (Foundation)

Learning “how” to speak by reading the internet. Massive scale, self-supervised.

2. SFT (Fine-Tuning)

Instruction following. Learning to behave like an assistant with curated data.

3. RLHF / DPO (Alignment)

Fine-tuning for human preference. Safety, style, and helpfulness filters.

10. Summary for LLMOps

VRAM Usage

Divided between Weights (long-term) and KV Cache (short-term memory). Manage with PagedAttention.

Inference Latency

TTFT: Time to first token (Reading prompt).
TPS: Tokens per second (Generating response).

Optimization

Continuous Batching and Quantization (INT8/FP8) are essential for production scale.

Top Global AI Research Hubs

Stanford HAI MIT CSAIL Carnegie Mellon Tsinghua University University of Oxford

Leave a Comment

Your email address will not be published. Required fields are marked *