অধ্যায় 55 — Math Behind LLMs

📖 একটি ছোট গল্প

GPT-4 কোনো word predict করছে — কিন্তু ভেতরে কী ঘটছে? 175 বিলিয়ন parameter, 96 layers, 12288 hidden dimensions। Scaling laws বলে — model বড় করলে capability discontinuously jump করে। কিন্তু কেন? এই অধ্যায়ে LLM-এর training, inference, এবং scaling-এর গাণিতিক ভিত্তি দেখব।

Intuitive Explanation

LLM = next token prediction — শুধু এই একটি objective, কিন্তু scale-এর কারণে reasoning, translation, coding, সব কিছু emerge করে।

Pre-training — internet-scale text-এ next token predict শেখা (unsupervised)।
Fine-tuning — specific task-এ adapt করা (supervised or RLHF)।
Inference — autoregressive generation, one token at a time।

Key insight: Compression = Intelligence — better next-token prediction = better world model।

Causal (Autoregressive) Attention

GPT = decoder-only Transformer, causal masking:

\text{Attention}(Q, K, V)_{ij} = \text{softmax}\left(\frac{Q_i K_j^T}{\sqrt{d_k}} + M_{ij}\right) V_j

Mask M_{ij} = -\infty if j > i (future tokens hide), 0 otherwise।

This ensures position i only attends to positions ≤ i — no information leak from future।

Scaling Laws

Loss (cross-entropy) scales predictably with compute, params, and data:

L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}

N = model parameters, N_c = critical scale, α_N ≈ 0.076 (GPT-3 paper)।

Similarly for dataset size D and compute C:

L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}

✨ টিপ

Chinchilla scaling law: for optimal training, parameters ও tokens should scale equally —N ∝ D (not N ∝ D^0.5 as previously thought)। 70B model trained on 1.4T tokens beats much larger models trained on less data.

KV Cache & Inference Optimization

Autoregressive generation-এ each step-এ পুরনো tokens recompute করলে slow:

Key trick: KV cache — previous step-এর K, V matrices store করে রাখি:

\text{Attention}_t = \text{softmax}\left(\frac{Q_t [K_{\leq t}]^T}{\sqrt{d_k}}\right) [V_{\leq t}]

K_{\leq t} = cached keys from positions 1 to t — append only!

Memory: O(L \times d \times n_{layers} \times 2) (K + V per layer) — long context-এ problem।
FlashAttention — tiling + recomputation দিয়ে memory savings, IO-aware optimization।
Multi-Query Attention (MQA) — all heads share same K, V, memory reduce করে 8×।
Grouped-Query Attention (GQA) — MQA ও multi-head-এর compromise, Llama 2-এ।

RLHF: Reinforcement Learning from Human Feedback

Human preferences দিয়ে model align করা:

Reward Model

r_\theta(x, y) = \text{RewardModel}(prompt } x, \text{ response } y)

Trained on pairwise comparisons: P(y_w > y_l) = \sigma(r(x, y_w) - r(x, y_l))

PPO (Proximal Policy Optimization)

\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta} [r(x,y)] - \beta KL(\pi_\theta \| \pi_{ref})

KL penalty = reference model (SFT) থেকে খুব দূরে চলে যেতে বাধা দেয় — stability।

Modern alternative: DPO (Direct Preference Optimization) — reward model train ছাড়াই policy optimize, simpler and often better।

Python: Causal Mask & KV Cache

pythonPython · NumPy

import torch
import torch.nn.functional as F

def causal_attention(Q, K, V, use_cache=False, past_kv=None):
    """Causal self-attention with optional KV cache."""
    if use_cache and past_kv is not None:
        past_K, past_V = past_kv
        K = torch.cat([past_K, K], dim=2)  # append new keys
        V = torch.cat([past_V, V], dim=2)  # append new values
    
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (Q.size(-1) ** 0.5)
    
    # Causal mask: only attend to previous positions
    seq_len = scores.size(-1)
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
    scores = scores.masked_fill(mask, float('-inf'))
    
    attn = F.softmax(scores, dim=-1)
    output = torch.matmul(attn, V)
    
    return output, (K, V) if use_cache else None

# Demo
batch, heads, dim = 1, 8, 64
seq = 10
Q = torch.randn(batch, heads, 1, dim)  # generating 1 new token
K = torch.randn(batch, heads, seq, dim)
V = torch.randn(batch, heads, seq, dim)

out, cache = causal_attention(Q, K, V, use_cache=True)
print(f"Output shape: {out.shape}, Cache K shape: {cache[0].shape}")

Emergence & Capabilities

In-context Learning — prompt-এ কিছু example দিলেই new task শিখে, weight update ছাড়া!
Chain-of-Thought — "let's think step by step" → reasoning path generate করে, accuracy jump।
Multimodal — image patches + text tokens একই embedding space-এ (GPT-4V, LLaVA)।
Tool Use — external function/API call generate করে, result পড়ে continue — loop until done।

Practice Tasks

Causal mask ছাড়া attention চালান — future information leak কীভাবে দেখতে পাবেন?
KV cache ছাড়া vs সহ — 1000 token generation-এ speed compare করুন (theoretical)।
Chinchilla law অনুযায়ী 7B model কত tokens দিয়ে train করবেন? কীভাবে নির্ধারণ করবেন?
PPO vs DPO — কোনটা simpler? কোনটার theoretical foundation stronger?

Interview Questions

Decoder-only vs encoder-decoder — LLM হিসেবে কেন decoder-only preferred?
Scaling law-এ α (alpha) কী represent করে? কেন small?
KV cache memory bottleneck কীভাবে overcome করবেন?
RLHF-এ reward hacking কী এবং কেন হয়?
In-context learning কেন weight update ছাড়াই কাজ করে — theory কী?

Summary · সারসংক্ষেপ

LLM = scaled-up autoregressive Transformer, next-token prediction objective।
Causal attention = future tokens masked, ensures autoregressive property।
Scaling laws = loss predictably decreases with N, D, C — Chinchilla = N ∝ D optimal।
KV cache = inference speed up, FlashAttention/MQA/GQA = memory optimization।
RLHF = human preference দিয়ে alignment, DPO = simpler alternative to PPO + reward model।

পূর্ববর্তী · CH 54

Math Behind NLP

পরবর্তী · CH 56

Math Behind Generative AI