CH 56Phase 8 · AI Math in Real-world Systems

Math Behind Generative AI

১৫–২৫ মিনিট বাংলা · Math · Python
📖 একটি ছোট গল্প

"A photograph of an astronaut riding a horse in space" — Stable Diffusion কীভাবে text থেকে image generate করে? ভেতরে কোনো artist নেই — শুধু noise-এর উপর gradual denoising process, guided by text embedding। Generative AI-এর পেছনে probability, differential equations, এবং information theory-এর beautiful combination আছে।

Intuitive Explanation

Generative model = data distribution p(x) শেখা, তারপর নতুন sample generate করা। তিনটি প্রধান family:

  • VAE — latent variable model, approximate posterior (CH 50)।
  • GAN — generator vs discriminator game, min-max objective।
  • Diffusion — noise gradually add/remove, Markov chain of latents (state-of-the-art)।

Modern Generative AI = mostly Diffusion + Transformer (image, video, audio, 3D)।

VAE for Generation

VAE = encoder q(z|x) + decoder p(x|z):

\log p(x) \geq \mathbb{E}_{q(z|x)}[\log p(x|z)] - KL(q(z|x) \| p(z)) = \text{ELBO}

Generation: prior p(z) = N(0,I) থেকে z sample → decoder দিয়ে x generate:

z \sim \mathcal{N}(0, I), \quad x = \text{Decoder}(z)

Limitation: blurry outputs (MSE reconstruction), limited expressiveness of Gaussian prior।

GAN: Minimax Game

Generator G: noise z → fake image, Discriminator D: image → real/fake probability।

\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

Optimal D: D^*(x) = p_{data}(x) / (p_{data}(x) + p_G(x))

At equilibrium: p_G = p_{data}, Jensen-Shannon divergence minimized।

Problems: mode collapse, training instability, vanishing gradients (D too strong)।

Diffusion: Forward Process

Forward = gradually add Gaussian noise (fixed, no learning):

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)

After T steps: q(x_T) \approx \mathcal{N}(0, I) — pure noise!

Convenient closed form (reparameterization):

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

\bar{\alpha}_t = \prod_{i=1}^t (1 - \beta_i) — noise schedule, monotonically decreasing।

Diffusion: Reverse Process

Reverse = learn to denoise (neural network):

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

Simplified objective (Ho et al. 2020): predict the noise ε:

L = \mathbb{E}_{x_0, \epsilon, t} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

At inference: sample noise x_T ~ N(0,I), then iteratively denoise:

x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right) + \sigma_t z

Classifier-Free Guidance

Text-to-image: noise prediction conditioned on text prompt c:

\hat{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + w \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))

w = guidance scale (usually 7-12):

  • w = 1 — standard conditional generation (text-follows)।
  • w > 1 — amplify text influence, more "on-prompt" but less diverse।
  • w → ∞ — pure text, no unconditional, often artifacts (oversaturated)।

Flow Matching & Modern Alternatives

Diffusion = discretized stochastic differential equation (SDE)।

Flow Matching (Rectified Flow, Stable Diffusion 3): deterministic path:

\frac{dx_t}{dt} = v_\theta(x_t, t), \quad x_0 \sim p_{data}, \; x_1 \sim \mathcal{N}(0, I)

Learn vector field v_\theta that transports data distribution to noise (and back)।

Advantages: fewer steps (10-20 vs 50+), straight trajectories, faster sampling।

Consistency Models — directly predict x₀ from any xₜ, single-step generation!

Python: Simple Diffusion Training Loop

pythonPython · NumPy
import torch
import torch.nn as nn

class SimpleUNet(nn.Module):
    """Toy noise prediction network."""
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(784 + 1, 256),  # image + timestep
            nn.ReLU(),
            nn.Linear(256, 784)
        )
    
    def forward(self, x, t):
        # x: (B, 784), t: (B,)
        inp = torch.cat([x, t.unsqueeze(1)], dim=1)
        return self.net(inp)

def diffusion_loss(model, x0, beta_schedule):
    """Train to predict noise."""
    batch = x0.size(0)
    t = torch.randint(0, len(beta_schedule), (batch,))
    
    alpha_bar = torch.cumprod(1 - beta_schedule, dim=0)
    a_t = alpha_bar[t].sqrt().view(-1, 1)
    one_minus_a = (1 - alpha_bar[t]).sqrt().view(-1, 1)
    
    noise = torch.randn_like(x0)
    xt = a_t * x0 + one_minus_a * noise
    
    predicted_noise = model(xt, t.float())
    return nn.functional.mse_loss(predicted_noise, noise)

# Example
beta = torch.linspace(1e-4, 0.02, 1000)
model = SimpleUNet()
x0 = torch.randn(32, 784)  # batch of flat images
loss = diffusion_loss(model, x0, beta)
print(f"Loss: {loss.item():.4f}")

Beyond Images: Video, 3D, Audio

  • Video Diffusion — temporal attention + 3D conv, latent space-এ diffusion (SVD)।
  • 3D Generation — NeRF + diffusion, voxel/plane representation (Gaussian Splatting, DreamFusion)।
  • Audio/Speech — spectrogram diffusion (AudioLDM) বা raw waveform (WaveNet, VALL-E)।
  • Multimodal — unified model text/image/video (GPT-4o, Gemini, Flamingo)।

Practice Tasks

  1. β schedule linear vs cosine — noise addition rate তুলনা করুন।
  2. Classifier-free guidance w = 1, 5, 10 — diversity vs prompt adherence trade-off।
  3. DDPM 1000 steps vs DDIM 50 steps — কোনটা faster? কেন?
  4. VAE reconstruction vs GAN generation — sharpness vs diversity trade-off কী?

Interview Questions

  1. Diffusion forward process কেন fixed? Reverse কেন learned?
  2. Noise prediction vs score matching — সম্পর্ক কী?
  3. Guidance scale বাড়ালে diversity কেন কমে?
  4. Flow matching diffusion-এর চেয়ে advantage কী?
  5. Stable Diffusion-এ VAE কেন ব্যবহার করা হয়? (latent space diffusion)।

Summary · সারসংক্ষেপ

  • Generative AI = data distribution p(x) শেখা + sample generation।
  • VAE = latent variable, ELBO optimize; GAN = minimax game; Diffusion = noise add/remove Markov chain।
  • Diffusion: forward fixed (noise schedule), reverse learned (noise prediction) — SOTA for images।
  • Classifier-free guidance = amplify conditioning, w > 1 = more prompt-faithful but less diverse।
  • Flow matching = deterministic ODE alternative, faster sampling; Consistency models = single-step!