অধ্যায় 39 — Embedding Space Intuition

📖 একটি ছোট গল্প

২০১৭ — Google-এর ৮ জন গবেষক একটি পেপার প্রকাশ করলেন: "Attention Is All You Need"। সেই একটি ধারণা — "প্রতিটি শব্দ বাকি সব শব্দের দিকে একসাথে তাকাবে" — RNN-কে অবসর দিয়ে GPT, BERT, LLaMA, Claude সবার জন্ম দিল। এটাই Transformer।

Scaled Dot-Product Attention

প্রতিটি token তিনটি vector তৈরি করে: Query, Key, Value।

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

QKᵀ — similarity score
√dₖ দিয়ে scale → softmax saturation এড়ানো
softmax → attention weight
V-এর weighted sum = output

Multi-Head Attention

একটি attention এক pattern শেখে। h-টি head একসাথে different subspace দেখে:

MultiHead(Q,K,V) = Concat(head₁, …, headₕ) Wᴼ

headᵢ = Attention(QWᵢQ, KWᵢK, VWᵢⱽ)

Transformer Block

x ← x + MultiHeadAttention(LN(x))

x ← x + FFN(LN(x))

FFN = two-layer MLP (usually GELU)। Residual + LayerNorm → stable deep training।

Positional Encoding

Attention permutation-invariant — তাই position info দিতে হয়:

PE(pos, 2i) = sin(pos / 10000^(2i/d))

PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

আধুনিক LLM (LLaMA, GPT-NeoX) → RoPE (Rotary Position Embedding) ব্যবহার করে।

Python Implementation

pythonPython · NumPy

import torch, torch.nn as nn, torch.nn.functional as F, math

class SelfAttention(nn.Module):
    def __init__(self, d, n_heads):
        super().__init__()
        self.h, self.dk = n_heads, d // n_heads
        self.qkv = nn.Linear(d, 3*d)
        self.out = nn.Linear(d, d)

    def forward(self, x, mask=None):
        B, T, D = x.shape
        q, k, v = self.qkv(x).chunk(3, dim=-1)
        # split heads → (B, h, T, dk)
        q = q.view(B, T, self.h, self.dk).transpose(1, 2)
        k = k.view(B, T, self.h, self.dk).transpose(1, 2)
        v = v.view(B, T, self.h, self.dk).transpose(1, 2)

        scores = q @ k.transpose(-2, -1) / math.sqrt(self.dk)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        ctx = attn @ v                              # (B, h, T, dk)
        ctx = ctx.transpose(1, 2).reshape(B, T, D)
        return self.out(ctx)

x = torch.randn(2, 16, 64)        # batch=2, seq=16, d=64
sa = SelfAttention(64, n_heads=8)
print(sa(x).shape)                # → [2, 16, 64]

Why Transformers Scale

Fully parallel → GPU/TPU efficient
Global receptive field থেকে শুরু
Scaling laws: loss ∝ N⁻ᵅ · D⁻ᵝ · C⁻ᵞ (Chinchilla)
একই architecture → text, image (ViT), audio, protein সব ক্ষেত্রে SOTA

Practice

Causal mask (lower-triangular) যোগ করে GPT-style decoder বানান।
n_heads=1 বনাম n_heads=8 — attention pattern কীভাবে আলাদা?

Summary · সারসংক্ষেপ

Attention = soft, differentiable lookup — QKᵀ/√dₖ এর softmax × V।
Multi-head → multiple representation subspace।
Residual + LayerNorm + FFN = Transformer block।
Position info ছাড়া attention অর্থহীন → sinusoidal বা RoPE।
এই একটি architecture-ই আধুনিক generative AI-র ভিত্তি।

পূর্ববর্তী · CH 38

Transformer Mathematics

পরবর্তী · CH 40

Information Theory