অধ্যায় 58 — Understanding Mathematical Equations in Papers

📖 একটি ছোট গল্প

একটি paper-এ দেখলাম \mathcal{L}_{\text{adv}} = \mathbb{E}_{x \sim p_{data}} [\log D(x)] + \mathbb{E}_{z \sim p_z} [\log(1 - D(G(z)))] — লেখক বলছেন "it follows from Jensen's inequality"। কিন্তু কীভাবে? Paper-এর equation বোঝা = notation শেখা + common pattern চিনতে পারা + proof gap নিজে পূরণ করা।

Common Notation Guide

AI paper-এর "alphabet soup" — একই letter বিভিন্ন meaning:

L — loss function (or Lagrangian, or number of layers)।
\mathcal{L} — calligraphic L = usually loss objective (e.g., ELBO, adversarial loss)।
\ell — lowercase l = per-sample loss (average করে L পাওয়া যায়)।
θ (theta) — model parameters (weights)।
φ (phi) — variational parameters, encoder parameters, or hyperparameters।
ψ (psi) — critic/discriminator parameters (GAN, RL)।
ℙ / 𝔼 — probability measure / expectation (over distribution)।
\sim — "distributed as" (e.g., z \sim \mathcal{N}(0, I))।
\propto — "proportional to" (normalization constant skip করা)।
\mathcal{O}(n) — big-O complexity, \Theta(n) — tight bound।

Common Equation Patterns

Pattern 1: Expectation of Loss

\mathbb{E}_{x \sim p} [f(x)] = \int f(x) p(x) dx \approx \frac{1}{N} \sum_{i=1}^N f(x_i)

Monte Carlo estimate — empirical risk minimization-এর ভিত্তি।

Pattern 2: KL Divergence

KL(q \| p) = \mathbb{E}_q [\log q(x) - \log p(x)] = \mathbb{E}_q [\log \frac{q(x)}{p(x)}]

ELBO, VAE, variational inference — সবখানে দেখা যায়।

Pattern 3: Chain Rule / Total Derivative

\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial h} \cdot \frac{\partial h}{\partial a} \cdot \frac{\partial a}{\partial W_1}

Backpropagation = repeated application of chain rule।

Pattern 4: Matrix Inner Product / Trace

\langle A, B \rangle = \text{tr}(A^T B) = \sum_{i,j} A_{ij} B_{ij}

Frobenius inner product, often written compactly as A:B।

Filling in Proof Gaps

Paper-এর proof অনেক সময় "exercise for the reader" — কীভাবে handle করবেন:

Identify the claim — theorem বলছে কী? converse true কিনা?
Work backwards — desired result থেকে শুরু, কী assumption চাই?
Test with numbers — concrete values দিয়ে verify (e.g., d=2, n=3)।
Find the missing lemma — often standard result (e.g., "by Cauchy-Schwarz")।
Check appendix — full proof অনেক সময় appendix-এ, main text-এ sketch থাকে।

💡 ইনসাইট

"Trivially" বা "obviously" লেখা থাকলে — author-এর কাছে obvious হতে পারে, আপনার জন্য নয়। নিজে derive না করে skip করবেন না।

Core Derivation Skills

Gradient check — finite difference দিয়ে analytic gradient verify: \frac{\partial f}{\partial x_i} \approx \frac{f(x + \epsilon e_i) - f(x - \epsilon e_i)}{2\epsilon}।
Taylor expansion — approximation near a point: f(x + h) \approx f(x) + f'(x)h + \frac{1}{2}f''(x)h^2।
Log trick — product → sum: \log \prod_i p_i = \sum_i \log p_i।
Exponential trick — sum → max: \log \sum_i e^{x_i} \approx \max_i x_i (approximate)।
Completing the square — Gaussian integral বা quadratic form simplify করতে।

Reproducing Paper Results

Math থেকে code — key checkpoints:

Hyperparameters table দিন — learning rate, batch size, weight decay, schedule।
Initialization check করুন — Xavier, Kaiming, or custom?
Loss curve shape দেখুন — expected behavior (monotonically decreasing? spikes?)।
Gradient norm track করুন — exploding gradients? (norm clipping threshold?)।
Ablation re-run করুন — paper-এর table exactly match হচ্ছে কিনা?

Python: Symbolic Gradient Check

pythonPython · NumPy

import numpy as np

def numerical_gradient(f, x, eps=1e-5):
    """Finite-difference gradient check."""
    grad = np.zeros_like(x)
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        idx = it.multi_index
        old = x[idx]
        
        x[idx] = old + eps
        f_plus = f(x)
        x[idx] = old - eps
        f_minus = f(x)
        
        grad[idx] = (f_plus - f_minus) / (2 * eps)
        x[idx] = old
        it.iternext()
    return grad

# Example: check gradient of f(x) = x^T A x
A = np.array([[2.0, 1.0], [1.0, 3.0]])
x = np.array([1.0, 2.0])

def f(x): return x.T @ A @ x
# Analytic gradient: (A + A^T) x = 2Ax (since A symmetric)
analytic = (A + A.T) @ x
numeric = numerical_gradient(f, x)
print("Analytic:", analytic)
print("Numeric: ", numeric)
print("Diff:    ", np.abs(analytic - numeric).max())

Practice Tasks

Attention mechanism-এর equation (CH 39) নিজে কাগজে derive করুন — matrix shape track সহ।
ELBO (CH 50) re-derive করুন — KL(q||p(z|x)) থেকে শুরু করে।
যেকোনো paper-এর "proof sketch" বের করে পূর্ণ proof কাগজে লিখুন।
Attention Is All You Need-এর attention complexity O(n²d) prove করুন।

Interview Questions

Paper-এর equation বুঝতে পারছেন না — কী approach নেবেন?
Gradient check কেন important? কখন fail হতে পারে?
Paper claim "universal approximation" — কীভাবে verify করবেন intuition দিয়ে?
Proof-এ "WLOG" দেখলে কী বোঝা উচিত?

Summary · সারসংক্ষেপ

Notation guide = AI paper পড়ার প্রথম defense — L, ℒ, θ, φ, ∼, ∝ জানতে হবে।
Common patterns: expectation, KL, chain rule, trace — দেখলেই চিনতে পারা উচিত।
Proof gap = backwards reasoning + concrete numbers + appendix search + standard lemmas।
Gradient check = math confidence-এর safety net, always verify with finite differences।
Reproduction = hyperparameters × initialization × loss shape × gradient norm × exact ablation।

পূর্ববর্তী · CH 57

Reading AI Research Papers

পরবর্তী · CH 59

AI Engineer Mathematical Roadmap