CH 54Phase 8 · AI Math in Real-world Systems

Math Behind NLP

১৫–২৫ মিনিট বাংলা · Math · Python
📖 একটি ছোট গল্প

"The cat sat on the mat" — এই sentence-এর meaning কীভাবে represent করবো সংখ্যায়? Traditional approach: bag of words, one-hot vector — কিন্তু "cat" ও "feline"-এর similarity capture করে না। Word embeddings (word2vec, GloVe) এই gap পূরণ করে — এবং তাদের পেছনের গণিত unexpected beauty আছে।

Intuitive Explanation

NLP-এর core challenge: discrete text → continuous vectors। প্রতিটি word, sentence, document-কে একটি vector space-এ represent করতে হয় যেখানে semantic similarity = geometric proximity।

  • Word level — word2vec, GloVe, fastText (word → vector)।
  • Sentence level — averaging, RNN, Transformer encoding (sentence → vector)।
  • Contextual — ELMo, BERT, GPT (same word, different context = different vector)।

Word2Vec Mathematics

Skip-gram Model

Given center word w_c, predict context words w_o:

P(w_o | w_c) = \frac{\exp(v_{w_o}^T v_{w_c})}{\sum_{w \in V} \exp(v_w^T v_{w_c})}

Softmax over entire vocabulary — |V| can be millions, slow!

Negative Sampling (word2vec)

Instead of full softmax, logistic regression on positive + K negative samples:

L = \log \sigma(v_{w_o}^T v_{w_c}) + \sum_{i=1}^K \mathbb{E}_{w_i \sim P_n(w)} [\log(1 - \sigma(v_{w_i}^T v_{w_c}))]

Training objective = implicit matrix factorization of word-context PMI matrix!

GloVe: Global Vectors

Word2Vec local (window-based), GloVe global (co-occurrence statistics):

J = \sum_{i,j=1}^V f(X_{ij}) (w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2

X_{ij} = how many times word j appears in context of word i, f = weighting function (rare co-occurrences downweight)।

💡 ইনসাইট
GloVe = global matrix factorization, word2vec = local implicit factorization — Levy & Goldberg (2014) proved they're fundamentally equivalent under certain conditions।

Attention in NLP

Self-attention (CH 39) NLP-এ revolution আনে:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Each word attends to all other words — global context in one layer!

Multi-head = h parallel attention mechanisms:

\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O

head_i = Attention(QWᵢ^Q, KWᵢ^K, VWᵢ^V)

BERT: Pre-training Math

Masked Language Model (MLM)

Random tokens mask করে predict:

L_{MLM} = -\mathbb{E}_{x \sim \mathcal{D}} \sum_{i \in \mathcal{M}} \log P(x_i | x_{\setminus \mathcal{M}})

\mathcal{M} = masked positions, x_{\setminus \mathcal{M}} = unmasked context।

Next Sentence Prediction (NSP)

L_{NSP} = -\log P(y | [CLS])

y = binary (is sentence B actually next to A?)।

Modern variant: RoBERTa NSP skip করে, ALBERT sentence order prediction (SOP) ব্যবহার করে।

Positional Encoding

Transformer has no inherent sequence order — positional information inject করতে হয়:

PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d})
PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d})

Properties: unique for each position, bounded [-1, 1], linearly related for relative positions (dot product)।

Modern: Learned positional embeddings (BERT, GPT) বা RoPE (rotary, Llama-এ)।

Python: Word Embedding from PMI

pythonPython · NumPy
import numpy as np

# Build co-occurrence matrix (tiny example)
words = ["the", "cat", "sat", "on", "mat", "dog"]
vocab = {w: i for i, w in enumerate(words)}

# Windows: (cat, the), (cat, sat), (sat, cat), (sat, on), ...
cooccur = np.zeros((6, 6))
windows = [
    ("the", "cat"), ("cat", "sat"), ("sat", "on"), ("on", "mat"),
    ("the", "dog"), ("dog", "sat"),
]
for w1, w2 in windows:
    cooccur[vocab[w1], vocab[w2]] += 1
    cooccur[vocab[w2], vocab[w1]] += 1

# PMI matrix
P = cooccur / cooccur.sum()
P_w = P.sum(axis=1, keepdims=True)
P_c = P.sum(axis=0, keepdims=True)
PMI = np.log((P + 1e-10) / (P_w @ P_c + 1e-10))

# SVD for embeddings (like GloVe)
U, S, Vt = np.linalg.svd(PMI)
k = 2
embeddings = U[:, :k] * np.sqrt(S[:k])

# "cat" and "dog" similarity
cat_vec = embeddings[vocab["cat"]]
dog_vec = embeddings[vocab["dog"]]
sim = np.dot(cat_vec, dog_vec) / (np.linalg.norm(cat_vec) * np.linalg.norm(dog_vec))
print(f"Cat-Dog cosine similarity: {sim:.3f}")

Modern NLP Architecture Math

  • Transformer-XL — segment-level recurrence, longer context (O(n²) → manageable)।
  • XLNet — permutation language modeling, autoregressive + bidirectional combine।
  • T5 — text-to-text统一 framework, encoder-decoder দিয়ে সব task handle।
  • Subword Tokenization — BPE, WordPiece: vocabulary balance (|V| small) + unknown token handle।

Practice Tasks

  1. Word2Vec negative sampling-এ K বাড়ালে কী হয়? Precision vs recall trade-off।
  2. Positional encoding-এ 10000 কেন? Frequency range কীভাবে choose করবেন?
  3. BERT-এ MLM probability কীভাবে compute করেন? P(w_i | context) বের করুন।
  4. Subword tokenization-এ "unhappiness" → কতটুকু break হবে? BPE vs WordPiece পার্থক্য?

Interview Questions

  1. Word2Vec skip-gram vs CBOW — কোনটা কখন better?
  2. Self-attention O(n²) complexity — long document-এ সমস্যা ও সমাধান?
  3. BERT-এ masked token prediction কেন bidirectional representation দেয়?
  4. GPT (decoder-only) vs BERT (encoder-only) — attention mask পার্থক্য?
  5. Embedding space-এ analogy (king - man + woman ≈ queen) কেন কাজ করে?

Summary · সারসংক্ষেপ

  • NLP core = discrete text → continuous vectors (embeddings)।
  • Word2Vec = local implicit factorization, GloVe = global explicit factorization — equivalent।
  • Self-attention = each token attends to all others, O(n²) but powerful global context।
  • BERT = masked LM + next sentence, bidirectional context।
  • Positional encoding injects sequence order into permutation-invariant Transformer।