"A photograph of an astronaut riding a horse in space" — Stable Diffusion কীভাবে text থেকে image generate করে? ভেতরে কোনো artist নেই — শুধু noise-এর উপর gradual denoising process, guided by text embedding। Generative AI-এর পেছনে probability, differential equations, এবং information theory-এর beautiful combination আছে।
Intuitive Explanation
Generative model = data distribution p(x) শেখা, তারপর নতুন sample generate করা। তিনটি প্রধান family:
- VAE — latent variable model, approximate posterior (CH 50)।
- GAN — generator vs discriminator game, min-max objective।
- Diffusion — noise gradually add/remove, Markov chain of latents (state-of-the-art)।
Modern Generative AI = mostly Diffusion + Transformer (image, video, audio, 3D)।
VAE for Generation
VAE = encoder q(z|x) + decoder p(x|z):
Generation: prior p(z) = N(0,I) থেকে z sample → decoder দিয়ে x generate:
Limitation: blurry outputs (MSE reconstruction), limited expressiveness of Gaussian prior।
GAN: Minimax Game
Generator G: noise z → fake image, Discriminator D: image → real/fake probability।
Optimal D: D^*(x) = p_{data}(x) / (p_{data}(x) + p_G(x))
At equilibrium: p_G = p_{data}, Jensen-Shannon divergence minimized।
Problems: mode collapse, training instability, vanishing gradients (D too strong)।
Diffusion: Forward Process
Forward = gradually add Gaussian noise (fixed, no learning):
After T steps: q(x_T) \approx \mathcal{N}(0, I) — pure noise!
Convenient closed form (reparameterization):
\bar{\alpha}_t = \prod_{i=1}^t (1 - \beta_i) — noise schedule, monotonically decreasing।
Diffusion: Reverse Process
Reverse = learn to denoise (neural network):
Simplified objective (Ho et al. 2020): predict the noise ε:
At inference: sample noise x_T ~ N(0,I), then iteratively denoise:
Classifier-Free Guidance
Text-to-image: noise prediction conditioned on text prompt c:
w = guidance scale (usually 7-12):
- w = 1 — standard conditional generation (text-follows)।
- w > 1 — amplify text influence, more "on-prompt" but less diverse।
- w → ∞ — pure text, no unconditional, often artifacts (oversaturated)।
Flow Matching & Modern Alternatives
Diffusion = discretized stochastic differential equation (SDE)।
Flow Matching (Rectified Flow, Stable Diffusion 3): deterministic path:
Learn vector field v_\theta that transports data distribution to noise (and back)।
Advantages: fewer steps (10-20 vs 50+), straight trajectories, faster sampling।
Consistency Models — directly predict x₀ from any xₜ, single-step generation!
Python: Simple Diffusion Training Loop
import torch
import torch.nn as nn
class SimpleUNet(nn.Module):
"""Toy noise prediction network."""
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(784 + 1, 256), # image + timestep
nn.ReLU(),
nn.Linear(256, 784)
)
def forward(self, x, t):
# x: (B, 784), t: (B,)
inp = torch.cat([x, t.unsqueeze(1)], dim=1)
return self.net(inp)
def diffusion_loss(model, x0, beta_schedule):
"""Train to predict noise."""
batch = x0.size(0)
t = torch.randint(0, len(beta_schedule), (batch,))
alpha_bar = torch.cumprod(1 - beta_schedule, dim=0)
a_t = alpha_bar[t].sqrt().view(-1, 1)
one_minus_a = (1 - alpha_bar[t]).sqrt().view(-1, 1)
noise = torch.randn_like(x0)
xt = a_t * x0 + one_minus_a * noise
predicted_noise = model(xt, t.float())
return nn.functional.mse_loss(predicted_noise, noise)
# Example
beta = torch.linspace(1e-4, 0.02, 1000)
model = SimpleUNet()
x0 = torch.randn(32, 784) # batch of flat images
loss = diffusion_loss(model, x0, beta)
print(f"Loss: {loss.item():.4f}")Beyond Images: Video, 3D, Audio
- Video Diffusion — temporal attention + 3D conv, latent space-এ diffusion (SVD)।
- 3D Generation — NeRF + diffusion, voxel/plane representation (Gaussian Splatting, DreamFusion)।
- Audio/Speech — spectrogram diffusion (AudioLDM) বা raw waveform (WaveNet, VALL-E)।
- Multimodal — unified model text/image/video (GPT-4o, Gemini, Flamingo)।
Practice Tasks
- β schedule linear vs cosine — noise addition rate তুলনা করুন।
- Classifier-free guidance w = 1, 5, 10 — diversity vs prompt adherence trade-off।
- DDPM 1000 steps vs DDIM 50 steps — কোনটা faster? কেন?
- VAE reconstruction vs GAN generation — sharpness vs diversity trade-off কী?
Interview Questions
- Diffusion forward process কেন fixed? Reverse কেন learned?
- Noise prediction vs score matching — সম্পর্ক কী?
- Guidance scale বাড়ালে diversity কেন কমে?
- Flow matching diffusion-এর চেয়ে advantage কী?
- Stable Diffusion-এ VAE কেন ব্যবহার করা হয়? (latent space diffusion)।
Summary · সারসংক্ষেপ
- Generative AI = data distribution p(x) শেখা + sample generation।
- VAE = latent variable, ELBO optimize; GAN = minimax game; Diffusion = noise add/remove Markov chain।
- Diffusion: forward fixed (noise schedule), reverse learned (noise prediction) — SOTA for images।
- Classifier-free guidance = amplify conditioning, w > 1 = more prompt-faithful but less diverse।
- Flow matching = deterministic ODE alternative, faster sampling; Consistency models = single-step!