GPT-4 কোনো word predict করছে — কিন্তু ভেতরে কী ঘটছে? 175 বিলিয়ন parameter, 96 layers, 12288 hidden dimensions। Scaling laws বলে — model বড় করলে capability discontinuously jump করে। কিন্তু কেন? এই অধ্যায়ে LLM-এর training, inference, এবং scaling-এর গাণিতিক ভিত্তি দেখব।
Intuitive Explanation
LLM = next token prediction — শুধু এই একটি objective, কিন্তু scale-এর কারণে reasoning, translation, coding, সব কিছু emerge করে।
- Pre-training — internet-scale text-এ next token predict শেখা (unsupervised)।
- Fine-tuning — specific task-এ adapt করা (supervised or RLHF)।
- Inference — autoregressive generation, one token at a time।
Key insight: Compression = Intelligence — better next-token prediction = better world model।
Causal (Autoregressive) Attention
GPT = decoder-only Transformer, causal masking:
Mask M_{ij} = -\infty if j > i (future tokens hide), 0 otherwise।
This ensures position i only attends to positions ≤ i — no information leak from future।
Scaling Laws
Loss (cross-entropy) scales predictably with compute, params, and data:
N = model parameters, N_c = critical scale, α_N ≈ 0.076 (GPT-3 paper)।
Similarly for dataset size D and compute C:
KV Cache & Inference Optimization
Autoregressive generation-এ each step-এ পুরনো tokens recompute করলে slow:
Key trick: KV cache — previous step-এর K, V matrices store করে রাখি:
K_{\leq t} = cached keys from positions 1 to t — append only!
- Memory: O(L \times d \times n_{layers} \times 2) (K + V per layer) — long context-এ problem।
- FlashAttention — tiling + recomputation দিয়ে memory savings, IO-aware optimization।
- Multi-Query Attention (MQA) — all heads share same K, V, memory reduce করে 8×।
- Grouped-Query Attention (GQA) — MQA ও multi-head-এর compromise, Llama 2-এ।
RLHF: Reinforcement Learning from Human Feedback
Human preferences দিয়ে model align করা:
Reward Model
Trained on pairwise comparisons: P(y_w > y_l) = \sigma(r(x, y_w) - r(x, y_l))
PPO (Proximal Policy Optimization)
KL penalty = reference model (SFT) থেকে খুব দূরে চলে যেতে বাধা দেয় — stability।
Modern alternative: DPO (Direct Preference Optimization) — reward model train ছাড়াই policy optimize, simpler and often better।
Python: Causal Mask & KV Cache
import torch
import torch.nn.functional as F
def causal_attention(Q, K, V, use_cache=False, past_kv=None):
"""Causal self-attention with optional KV cache."""
if use_cache and past_kv is not None:
past_K, past_V = past_kv
K = torch.cat([past_K, K], dim=2) # append new keys
V = torch.cat([past_V, V], dim=2) # append new values
scores = torch.matmul(Q, K.transpose(-2, -1)) / (Q.size(-1) ** 0.5)
# Causal mask: only attend to previous positions
seq_len = scores.size(-1)
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
scores = scores.masked_fill(mask, float('-inf'))
attn = F.softmax(scores, dim=-1)
output = torch.matmul(attn, V)
return output, (K, V) if use_cache else None
# Demo
batch, heads, dim = 1, 8, 64
seq = 10
Q = torch.randn(batch, heads, 1, dim) # generating 1 new token
K = torch.randn(batch, heads, seq, dim)
V = torch.randn(batch, heads, seq, dim)
out, cache = causal_attention(Q, K, V, use_cache=True)
print(f"Output shape: {out.shape}, Cache K shape: {cache[0].shape}")Emergence & Capabilities
- In-context Learning — prompt-এ কিছু example দিলেই new task শিখে, weight update ছাড়া!
- Chain-of-Thought — "let's think step by step" → reasoning path generate করে, accuracy jump।
- Multimodal — image patches + text tokens একই embedding space-এ (GPT-4V, LLaVA)।
- Tool Use — external function/API call generate করে, result পড়ে continue — loop until done।
Practice Tasks
- Causal mask ছাড়া attention চালান — future information leak কীভাবে দেখতে পাবেন?
- KV cache ছাড়া vs সহ — 1000 token generation-এ speed compare করুন (theoretical)।
- Chinchilla law অনুযায়ী 7B model কত tokens দিয়ে train করবেন? কীভাবে নির্ধারণ করবেন?
- PPO vs DPO — কোনটা simpler? কোনটার theoretical foundation stronger?
Interview Questions
- Decoder-only vs encoder-decoder — LLM হিসেবে কেন decoder-only preferred?
- Scaling law-এ α (alpha) কী represent করে? কেন small?
- KV cache memory bottleneck কীভাবে overcome করবেন?
- RLHF-এ reward hacking কী এবং কেন হয়?
- In-context learning কেন weight update ছাড়াই কাজ করে — theory কী?
Summary · সারসংক্ষেপ
- LLM = scaled-up autoregressive Transformer, next-token prediction objective।
- Causal attention = future tokens masked, ensures autoregressive property।
- Scaling laws = loss predictably decreases with N, D, C — Chinchilla = N ∝ D optimal।
- KV cache = inference speed up, FlashAttention/MQA/GQA = memory optimization।
- RLHF = human preference দিয়ে alignment, DPO = simpler alternative to PPO + reward model।