LLM Building Blocks: Attention & Transformers

Building Blocks of LLMs: A Comprehensive Guide

Building Blocks of LLMs

A comprehensive technical visual guide for LLMOps professionals. Explore the mathematical DNA and architectural anatomy of modern AI.

0 Recap: The Foundation

Before the “brain” can process information, it must translate human symbols into machine vectors.

Analogy: Like breaking a LEGO castle into individual bricks. We use algorithms like BPE or WordPiece to make sure even rare words (like “unbelievable”) are broken into pieces.

1. BPE Tokenization

2. Vector Embeddings

3. Positional Encoding

The Mathematical Result

By the end, text becomes a matrix $X \in \mathbb{R}^{n \times d_{model}}$, ready for processing.

1 The Attention Mechanism

The “Library Search” Analogy: Imagine searching for “Jaguar”. Query ($Q$) is your intent (Animal or Car?). Key ($K$) is the book spine label. Value ($V$) is the wisdom inside.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

4. Matrix Projections

5. Attention Scores

2 Dummy Example

// Token 1 (“Bank”)
Q1=[1,0], K1=[1,0], V1=[10,0]

// Token 2 (“River”)
Q2=[1,1], K2=[0,1], V2=[0,10]

The calculation resolves ambiguity. The output for “Bank” absorbs 33% of the “River” context, shifting its meaning in high-dimensional space.

Result: [6.7, 3.3] (River-Bank Context)

4 Rotary Positional (RoPE)

The Clock Analogy: Instead of adding numbers, RoPE “rotates” vectors. Distance is the angle between them.

6. Complex Rotation (RoPE)

6 The Transformer Block

The fundamental repeating unit of an LLM.

7. Layer Architecture

Attention: The “social” communication layer.
FFN: The “knowledge” storage layer (Facts).
Layer Norm: The statistical “stabilizer”.
Residuals: The “Express Lane” for signals.

7 Mixture of Experts (MoE)

8. Sparse Gating

The “Hospital” Analogy: In a Dense model, you see every doctor. In MoE, a Router (Triage Nurse) sends a “Physics” token only to the Physics Experts.

Allows massive models (like Mixtral) to run at the speed of small ones by only activating specific parameters per token.

L Logits & Decoding Strategies

After the layers process the vector, we have to pick a word. The model outputs “Logits” (raw scores), which we turn into probabilities via Softmax.

9. The Logit Pipeline

Decoding Comparison Table

Strategy	Mechanism	Best For…	The “Vibe”
Greedy	Pick the #1 highest prob.	Math, Coding, Facts	Deterministic / Rigid
Top-K	Pick from Top K words	General Chat	Balanced
Nucleus (Top-p)	Pick from top P% of mass	Creative Writing	Dynamic / Human-like
Beam Search	Track multiple paths	Translation, Summaries	Thorough / Optimized

M Hardware-Software Mapping

Connecting architectural needs to production infrastructure.

Constraint	Architecture Block	Hardware Need	Software Solution
VRAM (Capacity)	Model Weights / MoE	A100/H100 (80GB+)	Quantization (FP8/4-bit)
Compute (Flops)	Dense Attention layers	Tensor Cores (NVIDIA)	FlashAttention-3
Memory Bandwidth	Token Generation (TPOT)	HBM3e Memory	Speculative Decoding
Fragmentation	KV Cache Storage	High-speed SRAM	PagedAttention (vLLM)

8 The Training Lifecycle

1. Pretraining (Foundation)

Learning “how” to speak by reading the internet. Massive scale, self-supervised.

2. SFT (Fine-Tuning)

Instruction following. Learning to behave like an assistant with curated data.

3. RLHF / DPO (Alignment)

Fine-tuning for human preference. Safety, style, and helpfulness filters.

10. Summary for LLMOps

VRAM Usage

Divided between Weights (long-term) and KV Cache (short-term memory). Manage with PagedAttention.

Inference Latency

TTFT: Time to first token (Reading prompt).
TPS: Tokens per second (Generating response).

Optimization

Continuous Batching and Quantization (INT8/FP8) are essential for production scale.

Top Global AI Research Hubs

Stanford HAI MIT CSAIL Carnegie Mellon Tsinghua University University of Oxford