"The cat sat on the mat" — এই sentence-এর meaning কীভাবে represent করবো সংখ্যায়? Traditional approach: bag of words, one-hot vector — কিন্তু "cat" ও "feline"-এর similarity capture করে না। Word embeddings (word2vec, GloVe) এই gap পূরণ করে — এবং তাদের পেছনের গণিত unexpected beauty আছে।
Intuitive Explanation
NLP-এর core challenge: discrete text → continuous vectors। প্রতিটি word, sentence, document-কে একটি vector space-এ represent করতে হয় যেখানে semantic similarity = geometric proximity।
- Word level — word2vec, GloVe, fastText (word → vector)।
- Sentence level — averaging, RNN, Transformer encoding (sentence → vector)।
- Contextual — ELMo, BERT, GPT (same word, different context = different vector)।
Word2Vec Mathematics
Skip-gram Model
Given center word w_c, predict context words w_o:
Softmax over entire vocabulary — |V| can be millions, slow!
Negative Sampling (word2vec)
Instead of full softmax, logistic regression on positive + K negative samples:
Training objective = implicit matrix factorization of word-context PMI matrix!
GloVe: Global Vectors
Word2Vec local (window-based), GloVe global (co-occurrence statistics):
X_{ij} = how many times word j appears in context of word i, f = weighting function (rare co-occurrences downweight)।
Attention in NLP
Self-attention (CH 39) NLP-এ revolution আনে:
Each word attends to all other words — global context in one layer!
Multi-head = h parallel attention mechanisms:
head_i = Attention(QWᵢ^Q, KWᵢ^K, VWᵢ^V)
BERT: Pre-training Math
Masked Language Model (MLM)
Random tokens mask করে predict:
\mathcal{M} = masked positions, x_{\setminus \mathcal{M}} = unmasked context।
Next Sentence Prediction (NSP)
y = binary (is sentence B actually next to A?)।
Modern variant: RoBERTa NSP skip করে, ALBERT sentence order prediction (SOP) ব্যবহার করে।
Positional Encoding
Transformer has no inherent sequence order — positional information inject করতে হয়:
Properties: unique for each position, bounded [-1, 1], linearly related for relative positions (dot product)।
Modern: Learned positional embeddings (BERT, GPT) বা RoPE (rotary, Llama-এ)।
Python: Word Embedding from PMI
import numpy as np
# Build co-occurrence matrix (tiny example)
words = ["the", "cat", "sat", "on", "mat", "dog"]
vocab = {w: i for i, w in enumerate(words)}
# Windows: (cat, the), (cat, sat), (sat, cat), (sat, on), ...
cooccur = np.zeros((6, 6))
windows = [
("the", "cat"), ("cat", "sat"), ("sat", "on"), ("on", "mat"),
("the", "dog"), ("dog", "sat"),
]
for w1, w2 in windows:
cooccur[vocab[w1], vocab[w2]] += 1
cooccur[vocab[w2], vocab[w1]] += 1
# PMI matrix
P = cooccur / cooccur.sum()
P_w = P.sum(axis=1, keepdims=True)
P_c = P.sum(axis=0, keepdims=True)
PMI = np.log((P + 1e-10) / (P_w @ P_c + 1e-10))
# SVD for embeddings (like GloVe)
U, S, Vt = np.linalg.svd(PMI)
k = 2
embeddings = U[:, :k] * np.sqrt(S[:k])
# "cat" and "dog" similarity
cat_vec = embeddings[vocab["cat"]]
dog_vec = embeddings[vocab["dog"]]
sim = np.dot(cat_vec, dog_vec) / (np.linalg.norm(cat_vec) * np.linalg.norm(dog_vec))
print(f"Cat-Dog cosine similarity: {sim:.3f}")Modern NLP Architecture Math
- Transformer-XL — segment-level recurrence, longer context (O(n²) → manageable)।
- XLNet — permutation language modeling, autoregressive + bidirectional combine।
- T5 — text-to-text统一 framework, encoder-decoder দিয়ে সব task handle।
- Subword Tokenization — BPE, WordPiece: vocabulary balance (|V| small) + unknown token handle।
Practice Tasks
- Word2Vec negative sampling-এ K বাড়ালে কী হয়? Precision vs recall trade-off।
- Positional encoding-এ 10000 কেন? Frequency range কীভাবে choose করবেন?
- BERT-এ MLM probability কীভাবে compute করেন? P(w_i | context) বের করুন।
- Subword tokenization-এ "unhappiness" → কতটুকু break হবে? BPE vs WordPiece পার্থক্য?
Interview Questions
- Word2Vec skip-gram vs CBOW — কোনটা কখন better?
- Self-attention O(n²) complexity — long document-এ সমস্যা ও সমাধান?
- BERT-এ masked token prediction কেন bidirectional representation দেয়?
- GPT (decoder-only) vs BERT (encoder-only) — attention mask পার্থক্য?
- Embedding space-এ analogy (king - man + woman ≈ queen) কেন কাজ করে?
Summary · সারসংক্ষেপ
- NLP core = discrete text → continuous vectors (embeddings)।
- Word2Vec = local implicit factorization, GloVe = global explicit factorization — equivalent।
- Self-attention = each token attends to all others, O(n²) but powerful global context।
- BERT = masked LM + next sentence, bidirectional context।
- Positional encoding injects sequence order into permutation-invariant Transformer।