আপনি একটি ছবিতে দেখছেন — road, car, traffic light, pedestrian — সব একসাথে, overlapping, different scales। CNN কীভাবে এই সমস্ত information আলাদা করে বোঝে? Convolution, pooling, receptive field —প্রতিটি mathematical operation-এর পেছনে geometric intuition আছে।
Intuitive Explanation
Computer vision = pixels (numbers) থেকে meaning বের করা। Raw image = 3D tensor (H × W × C) — C = 3 (RGB) বা 1 (grayscale)। সমস্ত vision task এই tensor-এর উপর operation চালিয়ে meaning extract করে।
- Classification — entire image-এর label (cat vs dog)।
- Detection — object কোথায় আছে (bounding box + class)।
- Segmentation — pixel-by-pixel label (each pixel = which object)।
Convolution Mathematics
2D convolution (discrete) — kernel K slide করে image I-এর উপর:
Kernel weight শিখে — edge detector, color filter, texture pattern, সব কিছু।
Output size: O = \lfloor (W - K + 2P)/S \rfloor + 1 (padding P, stride S)।
Receptive Field
কতটুকু original image area একটি output neuron "দেখে"?
Example: 3 stacked 3×3 conv layers (stride 1) → RF = 7×7 (same as one 7×7 conv)।
But 3×3 conv has 3 \times 9 = 27 params vs 7×7 has 49 params — parameter efficiency!
Backpropagation in CNNs
Convolution-এ backprop = transpose convolution (deconvolution):
Weight gradient:
Practical: TensorFlow/PyTorch-এ conv2d_backward automatically handle করে — কিন্তু paper পড়ার সময় "why dilated conv works" বোঝার জন্য math জানা দরকার।
Modern Architecture Math
- ResNet Skip — y = F(x) + x, gradient flow preserve করে (vanishing gradient solve)।
- Inception — multiple filter sizes (1×1, 3×3, 5×5) parallel, 1×1 conv channel reduce করে cost কমায়।
- Dilated Conv — (I *_d K)(i,j) = \sum I(i+d\cdot m, j+d\cdot n) K(m,n), RF বাড়ায় resolution কমায় না।
- Attention (ViT) — image patches = sequence, self-attention দিয়ে global relationship (CH 39)।
- Feature Pyramid — multi-scale features concatenate, object detection-এ small + large object handle করে।
Object Detection Math
Bounding Box Regression
Predicted box = anchor box + delta (relative offsets):
Log for width/height — multiplicative scaling-এর জন্য, center coordinate linear।
IoU (Intersection over Union)
IoU > 0.5 → positive, < 0.3 → negative, in-between → ignore (hard negatives)।
Python: Convolution from Scratch
import numpy as np
def conv2d(input, kernel, stride=1, pad=0):
"""Simple 2D convolution."""
H, W = input.shape
kH, kW = kernel.shape
# Pad
if pad > 0:
input = np.pad(input, pad, mode='constant')
H, W = input.shape
outH = (H - kH) // stride + 1
outW = (W - kW) // stride + 1
output = np.zeros((outH, outW))
for i in range(outH):
for j in range(outW):
region = input[i*stride:i*stride+kH, j*stride:j*stride+kW]
output[i, j] = np.sum(region * kernel)
return output
# Edge detection kernel (Sobel)
sobel_x = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
image = np.random.randn(10, 10)
edges = conv2d(image, sobel_x, pad=1)
print("Input shape:", image.shape, "→ Output shape:", edges.shape)Vision Transformers (ViT)
Image-কে patches-এ ভাগ:
P = patch size (usually 16), N = HW/P² = sequence length।
Each patch linearly embed + position embedding:
Standard Transformer encoder (CH 39) — global attention সব patch-এর মধ্যে।
Practice Tasks
- 3×3 conv, stride 2, padding 1 — 32×32 image-এর output size কত?
- ResNet skip connection ছাড়া gradient কীভাবে vanish করে? Math দিয়ে দেখান।
- Dilated conv d=2 হলে RF কত বাড়ে? (3×3 conv-এর জন্য)।
- ViT-এ 224×224 image, patch=16 → sequence length কত?
Interview Questions
- CNN-এ parameter sharing কেন দরকার? Translation equivariance কী?
- 3×3 conv দুইবার vs 5×5 conv একবার — params ও RF compare করুন।
- BatchNorm vs LayerNorm — vision-এ কোনটা কেন?
- ViT কেন large dataset (ImageNet-21k) ছাড়া CNN-এর চেয়ে খারাপ?
- Object detection-এ anchor box কী এবং কেন দরকার?
Summary · সারসংক্ষেপ
- CV = pixel tensor-এর উপর mathematical operations (conv, pool, attention)।
- Convolution = local weighted sum, learnable kernel দিয়ে feature extract করে।
- Receptive field = original image area যা output neuron দেখে, deeper = larger RF।
- ResNet skip = gradient flow preserve, dilated conv = larger RF without resolution loss।
- ViT = image patches + Transformer, CNN-এর inductive bias হারায় কিন্তু global context capture করে।