অধ্যায় 53 — Math Behind Computer Vision

📖 একটি ছোট গল্প

আপনি একটি ছবিতে দেখছেন — road, car, traffic light, pedestrian — সব একসাথে, overlapping, different scales। CNN কীভাবে এই সমস্ত information আলাদা করে বোঝে? Convolution, pooling, receptive field —প্রতিটি mathematical operation-এর পেছনে geometric intuition আছে।

Intuitive Explanation

Computer vision = pixels (numbers) থেকে meaning বের করা। Raw image = 3D tensor (H × W × C) — C = 3 (RGB) বা 1 (grayscale)। সমস্ত vision task এই tensor-এর উপর operation চালিয়ে meaning extract করে।

Classification — entire image-এর label (cat vs dog)।
Detection — object কোথায় আছে (bounding box + class)।
Segmentation — pixel-by-pixel label (each pixel = which object)।

Convolution Mathematics

2D convolution (discrete) — kernel K slide করে image I-এর উপর:

(I * K)(i,j) = \sum_{m=-a}^{a} \sum_{n=-b}^{b} I(i+m, j+n) \cdot K(m,n)

Kernel weight শিখে — edge detector, color filter, texture pattern, সব কিছু।

Output size: O = \lfloor (W - K + 2P)/S \rfloor + 1 (padding P, stride S)।

💡 ইনসাইট

Filter size 3×3, stride 1, padding 1 → same output size (common in modern CNNs)।

Receptive Field

কতটুকু original image area একটি output neuron "দেখে"?

RF_{\ell} = RF_{\ell-1} + (K_\ell - 1) \times \prod_{i=1}^{\ell-1} S_i

Example: 3 stacked 3×3 conv layers (stride 1) → RF = 7×7 (same as one 7×7 conv)।

But 3×3 conv has 3 \times 9 = 27 params vs 7×7 has 49 params — parameter efficiency!

Backpropagation in CNNs

Convolution-এ backprop = transpose convolution (deconvolution):

\frac{\partial L}{\partial I} = \frac{\partial L}{\partial O} * K_{\text{rotated}}

Weight gradient:

\frac{\partial L}{\partial K} = I * \frac{\partial L}{\partial O}

Practical: TensorFlow/PyTorch-এ conv2d_backward automatically handle করে — কিন্তু paper পড়ার সময় "why dilated conv works" বোঝার জন্য math জানা দরকার।

Modern Architecture Math

ResNet Skip — y = F(x) + x, gradient flow preserve করে (vanishing gradient solve)।
Inception — multiple filter sizes (1×1, 3×3, 5×5) parallel, 1×1 conv channel reduce করে cost কমায়।
Dilated Conv — (I *_d K)(i,j) = \sum I(i+d\cdot m, j+d\cdot n) K(m,n), RF বাড়ায় resolution কমায় না।
Attention (ViT) — image patches = sequence, self-attention দিয়ে global relationship (CH 39)।
Feature Pyramid — multi-scale features concatenate, object detection-এ small + large object handle করে।

Object Detection Math

Bounding Box Regression

Predicted box = anchor box + delta (relative offsets):

t_x = (x - x_a)/w_a, \quad t_w = \log(w / w_a)

Log for width/height — multiplicative scaling-এর জন্য, center coordinate linear।

IoU (Intersection over Union)

\text{IoU} = \frac{|B_{pred} \cap B_{gt}|}{|B_{pred} \cup B_{gt}|}

IoU > 0.5 → positive, < 0.3 → negative, in-between → ignore (hard negatives)।

Python: Convolution from Scratch

pythonPython · NumPy

import numpy as np

def conv2d(input, kernel, stride=1, pad=0):
    """Simple 2D convolution."""
    H, W = input.shape
    kH, kW = kernel.shape
    
    # Pad
    if pad > 0:
        input = np.pad(input, pad, mode='constant')
        H, W = input.shape
    
    outH = (H - kH) // stride + 1
    outW = (W - kW) // stride + 1
    output = np.zeros((outH, outW))
    
    for i in range(outH):
        for j in range(outW):
            region = input[i*stride:i*stride+kH, j*stride:j*stride+kW]
            output[i, j] = np.sum(region * kernel)
    
    return output

# Edge detection kernel (Sobel)
sobel_x = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
image = np.random.randn(10, 10)
edges = conv2d(image, sobel_x, pad=1)
print("Input shape:", image.shape, "→ Output shape:", edges.shape)

Vision Transformers (ViT)

Image-কে patches-এ ভাগ:

x_p = \text{Reshape}(x) \in \mathbb{R}^{N \times (P^2 \cdot C)}

P = patch size (usually 16), N = HW/P² = sequence length।

Each patch linearly embed + position embedding:

z_0 = [x_{class}; x_p^1 E; x_p^2 E; \dots] + E_{pos}

Standard Transformer encoder (CH 39) — global attention সব patch-এর মধ্যে।

✨ টিপ

ViT-এ CNN-এর inductive bias (translation equivariance) নেই — তাই অনেক data লাগে। Hybrid: CNN stem + Transformer = data efficient (CoaT, CvT)।

Practice Tasks

3×3 conv, stride 2, padding 1 — 32×32 image-এর output size কত?
ResNet skip connection ছাড়া gradient কীভাবে vanish করে? Math দিয়ে দেখান।
Dilated conv d=2 হলে RF কত বাড়ে? (3×3 conv-এর জন্য)।
ViT-এ 224×224 image, patch=16 → sequence length কত?

Interview Questions

CNN-এ parameter sharing কেন দরকার? Translation equivariance কী?
3×3 conv দুইবার vs 5×5 conv একবার — params ও RF compare করুন।
BatchNorm vs LayerNorm — vision-এ কোনটা কেন?
ViT কেন large dataset (ImageNet-21k) ছাড়া CNN-এর চেয়ে খারাপ?
Object detection-এ anchor box কী এবং কেন দরকার?

Summary · সারসংক্ষেপ

CV = pixel tensor-এর উপর mathematical operations (conv, pool, attention)।
Convolution = local weighted sum, learnable kernel দিয়ে feature extract করে।
Receptive field = original image area যা output neuron দেখে, deeper = larger RF।
ResNet skip = gradient flow preserve, dilated conv = larger RF without resolution loss।
ViT = image patches + Transformer, CNN-এর inductive bias হারায় কিন্তু global context capture করে।

পূর্ববর্তী · CH 52

Math Behind Recommendation Systems

পরবর্তী · CH 54

Math Behind NLP