Computer Vision 3

Hurile BorjiginHurile Borjigin2024-10-07

Introduction and Object Detection

What this course is

Hight-level(semantic) computer vision:

From another perspective this is the intersection of Computer vision, deep learning and real-world applications, as shown in Fig 1.

Another perspective

What is computer vision:

Semantic scene understanding:

This course in two dimension is as shown in Fig 2.

CV III in two Dimensions

Understanding an image

Different representations depending on the granularity

Understanding a video

Why use the temporal domain

But challenges:

Some architectures and concepts

Object detection

One-stage detectors

One-stage detectors treat object detection as a single task, directly predicting the object class and bounding box coordinates from the input images.

Object detection with sliding window

For every position, measure the distance(or correlation) between the template and the image region(See Fig 3):

One-stage detector(old)
L(x0,y0)=d(I(x0,y0),T)\labeltemplatesimilarity\begin{equation} L(x_0, y_0) = d(I_{(x_0, y_0)}, T) \label{template similarity} \end{equation}

where d`d` is the distance metric, I(x0,y0)`I_{(x_0, y_0)}` is the image region, and T`T` is the template.
Template matching distances

Disadvantages

Feature-based detection

Idea: Learn feature-based classifiers invariant to natural object changes(See Fig 4)

One-stage detector(new)

Viola-Jones detector

Given data (xi,yi)`(x_i, y_i)`

  1. Define a set of Haar-like features(Fig 6)

    Haar-features are sensitive to directionality of patterns

    Haar features
  2. Find a weak classifier with the lowest error across the dataset

  3. Save the weak classifier and update the priority of the data samples

  4. Repeat Steps 2-3 N`N` times

Final classifier is the linear combination of all weak learners.
Histogram of Oriented Gradients

Average gradient image over training samples `\rightarrow` gradients provide shape information.

HOG descriptor `\rightarrow` Histogram of oriented gradients.

Compute gradients in dense grids, compute gradients and create a histogram based on gradient direction.

  1. Choose your training set of images that contain the object you want to detect.

  2. Choose a set of images that do NOT contain that object.

  3. Extract HOG features from both sets.

  4. Train an SVM classifier on the two sets to detect whether a feature vector represents the object of interest or not(0/1 classification)

Deformable Part Model

Many objects are not rigid, so we could use a Bottom-up approach:

  1. detect body parts

  2. detect "person" if the body parts are in correct arrangement

Note: The amount of work for each RoI may grow significantly.

Two-stage detectors

Two-stage object detectors split the object detection task to two parts(Fig 7):

  1. Region Proposal: The network first identifies potential regions in the image that may contain objects.

  2. Refinement: These regions are further analyzed to classify the objects and refine their bounding boxes.

2-stage detection

A generic, class-agnostic objectness measure: object proposals or regions of interest(RoI)

Object proposal methods:

  1. Selective search

    - Using class-agnostic segmentation at multiple scales.

  2. Edge boxes

    - Bounding boxes that wholly enclose detected contours.

Non-Maximum Suppression(NSM)

Method to keep only the best proposals(See algorithm [alg:nms]).

Bnms`B_{\text{nms}} \gets \emptyset` discardFalse`\text{discard} \gets \text{False}` discardTrue`\text{discard} \gets \text{True}` BnmsBnmsbi`B_{\text{nms}} \gets B_{\text{nms}} \cup b_i` Bnms`B_{\text{nms}}`

Region overlap

We measure region overlap with the Intersection over Union(IoU) or Jaccard Index:

J(A,B)=ABABJ(A, B) = \frac{|A \cap B|}{A \cup B}

The threshold is the decision boundary used to determine whether a detected object or prediction should be considered valid. For example, in object detection:

Object detection with deep networks

Overfeat

Explores three well-known vision tasks using a single framework.

Training convolution on all tasks boosts up the accuracy, see Fig 8.

Overfeat

Apply Non-Max Suppression to combine the predictions and windows, see Fig 9.

NMS

Improved detection accuracy is largely due to feature representation learned with deep nets.

But cons:

The complexity of a sliding window is O(ND)`O(ND)`, where N`N` is the number of all the windows, and D`D` represents the amount of work needed to perform detection and classification for one window. The complexity of the "pre-filtered" method is O(Nd+nD)`O(Nd + nD)`, where d`d` is the amount of work needed for filtering each window, and n`n` is the number of windows left after being filtered.

"Pre-filtering" method pays off when:

O(ND)>O(Nd+nD)\begin{equation} O(ND) > O(Nd + nD) \end{equation}

Assume the constant factors are comparable/negligible:

nN+dD<1\begin{equation} \frac{n}{N} + \frac{d}{D} < 1 \end{equation}

where nN`\frac{n}{N}` is the Region-of-Interest(RoI) ratio, and dD`\frac{d}{D}` is the efficiency of RoI generator. In practice, there is a delicate balance between n`n` and d`d`: Reducing d`d` and increasing n`n`.

Object proposals(Pre-filtering) and pooling for two-stage detection

Heuristic-based object proposal methods:

  1. Selective search

    - Using class-agnostic segmentation at multiple scales.

  2. Edge boxes

    - Bounding boxes that wholly detected contours

R-CNN family(Regions with CNN features)

R-CNN

Training scheme:

  1. Pre-train the CNN for image classification(ImageNet)

  2. Finetune the CNN on the number of classes the detector is aiming to classify

  3. Train a linear SVM classifier to classify image regions - one linear SVM per class

  4. Train the bounding box regressor

Pros:

Cons:

Problems:

  1. Input image has a fixed size

    FC layer prevents us from dealing with any image size.

  2. TBD

SPP-Net

Fast R-CNN

Fast R-CNN = R-CNN + RoI Pooling(Single layer SPP-Net)

Fast R-CNN

RoI Pooling RoI Pooling is a technique introduced in the Fast R-CNN framework to efficiently extract fixed-size feature representations for arbitrary regions(bounding boxes) in a feature map. It address the problem of feeding region proposals of various sizes into a neural network in a way that:

1. Feature extraction for the entire image

Instead of cropping individual region proposals from the original image and passing them each though a ConvNet(as done in older approaches like R-CNN), we:

This means the network only processes the image once, which is much more efficient.

2. Mapping region proposals to the feature map

You have a set of region proposals(bounding boxes) in the original image coordinates-these come from an external region proposal method(e.g., selective search) or from a region proposal network(in Faster R-CNN).

Each region proposal is then mapped onto the feature map coordinates:

- Because the CNN reduces spatial resolution(due to pooling layers or strided convolutions), the coordinates of each region of the original image need to be scaled to align with the coordinate system of the smaller feature map.

3. Dividing each RoI into Sub-regions

To handle different RoI sizes but produce a fixed-size output(for example 7X7 output grid in Fast R-CNN):

4. Pooling operation

Within each of these sub-regions(bins):

By doing this for each subregion in the grid, you transform the entire RoI into a fixed-size feature map(e.g., 7X7), regardless of the original size of the bounding box.

5. Feeding pooled features to classifier layers

Note that you have a fixed spatial dimension(e.g., 7X7), you can:

Result of Fast R-CNN

Fast R-CNN results

Faster R-CNN

Faster R-CNN

Faster R-CNN = Fast R-CNN + Region proposal network

Region Proposal Network: We fix the number of proposals by using a set of n=9`n = 9` anchors per location.

9 anchors=3 scales ×3 aspect ratios\begin{equation} 9 \text{ anchors} = 3 \text{ scales } \times 3 \text{ aspect ratios} \end{equation}

A 256-d descriptor characterizes every location.

For feature map location, we get a set of anchor corrections and classification into object/non-object.

RPN: Training

Classification ground truth: We compute p`p^*` which indicates how much an anchor overlaps with the ground truth bounding boxes:

p=1 if IoU>0.7\begin{equation} p^* = 1 \text{ if } IoU > 0.7 \end{equation} p8=0 if IoU<0.3\begin{equation} p^8 = 0 \text{ if } IoU < 0.3 \end{equation}

Faster R-CNN training:

- First implementation, training of RPN separate from the rest.

- Now we can train jointly!

- Four losses:

  1. RPN classification(object/non-object)

  2. RPN regression(anchor – proposal)

  3. Fast R-CNN classification(type of object)

  4. Fast R-CNN regression(proposal – box)

Faster R-CNN Performance

Feature Pyramid Networks

CNNs are not scale-invariant.

Straightforward implementation:

Integration with RPN for object detection:

Pros:

Cons: Increased model complexity.

Single-stage object detection

YOLO

YOLO

Inference time:

YOLO Inference

It is more efficient than Faster R-CNN, but less accurate.

SSD

Pros:

Cons:

Single shot detection

RetinaNet

Problems with one-stage detectors

Two-stage detectors:

Problems with one-stage detectors:

Class imbalance:

- Idea: balance the positives/negatives in the loss function.

- Recall cross-entropy loss:

CE loss

Focal loss

Replace CE with focal loss(FL):

Focal loss

RetinaNet

Retina

Spatial Transformers

Features of Spatial pooling:

Solution: Spatial Transformers.

Equivariance is needed:

f(A(x))=A(f(x))\begin{equation} f(A(x)) = A(f(x)) \end{equation}

Cons:

- Difficulty of training for generalizing to more challenging scenes.

Detection evaluation

Precision: How many zebras you found are actually zebras?

Precision =TPTP+FP\begin{equation} \text{Precision } = \frac{TP}{TP + FP} \end{equation}

Recall: How many actual zebras in the image/dataset you could find?

Recall =TPTP+FN\begin{equation} \text{Recall } = \frac{TP}{TP + FN} \end{equation}

What is a true positive?

Resolving conflicts

All-in-one metric: Average precision(AP)

There is often a trade-off between Precision and Recall.

AP

AP = the area under the Precision-Recall curve(AUC)

Computing average precision

  1. Sort the predicted boxes by confidence score

  2. For each prediction find the associated ground truth

    - The ground truth should be unassigned and pass IoU threshold

  3. Compute cumulative TP and FP:

    - TP: [1, 0, 1] FP: [0, 1, 0] `\rightarrow` cTP = [1, 1, 2], cFP = [0, 1, 1]

  4. Compute precision and recall(#GT = 3 `\rightarrow` TP + FN + 3)

    - Precision: [1/1,1/2,2/3]`[1/1, 1/2, 2/3]`; Recall(#GT=3): [1/3,1/3,2/3]`[1/3, 1/3, 2/3]`

  5. Plot the precision-recall curve and the area beneath(num. integration).

mAP is the average over object categories

Single object tracking

Problem statement: Given observations at t`t`, and find their correspondence at t+1(n0)`t+1(n \neq 0)`

Problem state

Challenges:

A simple solution: Tracking by detection.

- Detect the object in every frame.

Problem: data association.

Bayesian tracking

Bayesian example

Goal: Estimate car position at each time instant (say, the white car).

Observation: Image sequence and known background.

Observations: Image

System state: Car position(x,y`x, y`)

Bayesian probabilities

Notations:

Goal:

Estimate posterior probability p(xkZk)`p(x_k|Z_k)`.

How? Recursion:

p(xk1Zk1)p(xkZk)\begin{equation} p(x_{k-1}|Z_{k-1}) \rightarrow p(x_k|Z_k) \end{equation}

Hidden Markov model

Two assumptions of HMM:

  1. p(zkxk,Zk1)=p(zkxk)\begin{equation} p(z_k|x_k, Z_{k-1}) = p(z_k|x_k) \end{equation}
  2. p(xkxk1,Zk1)=p(xkxk1)\begin{equation} p(x_k|x_{k-1}, Z_{k-1}) = p(x_k|x_{k-1}) \end{equation}

Recursive Estimation

p(xkZk)=p(xkzk,Zk1)(Applying Bayes’ theorem)p(zkxk,Zk1)p(xkZk1)p(zkxk)p(xkZk1)(Assumption: p(zkxk,Zk1)=p(zkxk))p(zkxk)p(xk,xk1Zk1)dxk1(Marginalization)p(zkxk)p(xkxk1,Zk1)p(xk1Zk1)dxk1p(zkxk)p(xkxk1)p(xk1Zk1)dxk1(Assumption: p(xkxk1,Zk1)=p(xkxk1))\begin{align*} p(x_k | \mathbf{Z}_k) &= p(x_k | z_k, \mathbf{Z}_{k-1}) \quad \text{(Applying Bayes' theorem)} \\ &\propto p(z_k | x_k, \mathbf{Z}_{k-1}) \cdot p(x_k | \mathbf{Z}_{k-1}) \\ &\propto p(z_k | x_k) \cdot p(x_k | \mathbf{Z}_{k-1}) \quad \text{(Assumption: $p(z_k | x_k, \mathbf{Z}_{k-1}) = p(z_k | x_k)$)} \\ &\propto p(z_k | x_k) \cdot \int p(x_k, x_{k-1} | \mathbf{Z}_{k-1}) \, dx_{k-1} \quad \text{(Marginalization)} \\ &\propto p(z_k | x_k) \cdot \int p(x_k | x_{k-1}, \mathbf{Z}_{k-1}) \cdot p(x_{k-1} | \mathbf{Z}_{k-1}) \, dx_{k-1} \\ &\propto p(z_k | x_k) \cdot \int p(x_k | x_{k-1}) \cdot p(x_{k-1} | \mathbf{Z}_{k-1}) \, dx_{k-1} \quad \text{(Assumption: $p(x_k | x_{k-1}, \mathbf{Z}_{k-1}) = p(x_k | x_{k-1})$)} \end{align*}

Key Concepts:

Bayesian formulation:

p(xkZk)=kp(zk,xk)p(xkxk1)p(xk1Zk1)dxk1\begin{equation} p(x_k|Z_k) = k \cdot p(z_k, x_k) \cdot \int{p(x_k|x_{k-1}) \cdot p(x_{k-1}|Z_{k-1})dx_{k-1}} \end{equation}

Estimators:

Assume the posterior probability p(xkZk)`p(x_k|Z_k)` is known:

Deep networks:

Online vs. Offline tracking

Online tracking:

- "Given observations so far, estimate the current state."

Offline tracking:

- "Given all observations, estimate any state."

An online tracking model can be used for offline tracking too. Our recursive Bayesian model will still work.

GOTURN

GOTURN

Temporal prior:

p(xkxk1)=δ(xkxk1)\begin{equation} p(x_k|x_{k-1}) = \delta(x_k - x_{k-1}) \end{equation}

where δ`\delta` is Dirac delta function.

Temporal prior

Advantages:

Disadvantages:

Online adaptation

Problem: The initial object appearance may change drastically,

- e.g., due to occlusion, pose change, etc.

Idea: adapt the appearance model on the (unlabeled) test sequence.

MDNet

Tracking with online adaptation

MDNet

Multi-object tracking

Challenges:

DL’s role in MOT:

  1. Tracking initialization(e.g., using a detector.)

    - Deep learning `\rightarrow` more accurate detectors.

  2. Prediction of the next position(motion model).

    - We can learn a motion model from data.

  3. Matching predictions with detections:

    - Learning robust metrics(robust to pose changes, partial occlusions, etc.).

    - Matching can be embedded into the model.

Approach 1

  1. Track initialization(e.g., using a detector).

  2. Prediction of the next position(motion model).

    - Classic: Kalman filter(e.g., state transition model.)

    - DL(data driven): LSTM/GRU.

  3. Matching predictions with detections(appearance model).

    - In general: reduction to the assignment problem.

    - Bipartite matching.

Typical models of dynamics

Bipartite matching

Bipartite matching
  1. Define distances between boxes(e.g., 1 - IoU).

    - We obtain N×N`N \times N` matrix.

  2. Solve the assignment.

    - Using Hungarian algorithm O(N3)`O(N^3)`

  3. The bipartite matching solution corresponds to the minimum total cost.

What happens if one detection is missing?

  1. Add a pseudo detection with a fixed cost(e.g., 0).

  2. Run the Hungarian method as before.

  3. Discard the assignment to the pseudo node.

What happens if no prediction is suitable?

Pseudo Node

Approach 2: Tracktor

  1. Detect objects in frame k1`k-1`(e.g., using an object detector)

  2. Initialize in the same location in frame j.

  3. Refine predictions with regression.

  4. Run object detection again to find new tracks.

Tractor

Advantages:

  1. We can reuse well-engineered object detectors:

    - The same architecture of regression and classification heads.

  2. Work well even if trained on still images:

    - The regression head is agnostic to the object ID and category.

Disadvantages:

  1. No motion model:

    - problems due to large motions(camera, objects) or low frame rate.

  2. Confusion in crowded spaces:

    - Since there is no notion of "identity" in the model.

  3. Temporary occlusions(the track is "killed"):

    - Generally applies to all online trackers.

    - Partial remedy: Long-term memory of tracks(using object embeddings).

Problem of using IoU as metric:

The implicit assumption of small motion.

We need a more robust metric.

Metric learning: For re-identification(Re-ID)

Metric spaces

Definition:

A set X`X`(e.g., containing images) is said to be a metric space if with any two points p`p` and q`q` of X`X` there is associated a real number d(p,q)`d(p, q)`, called the distance from p`p` to q`q`, such that

Any function with these properties is called a distance function, or a metric.

Let’s reformulate:

d(p,q)=dω(fθ(p),fθ(q))\begin{equation} d(p, q) = d_{\omega}(f_{\theta}(p), f_{\theta}(q)) \end{equation}

How do we train a network to learn a feature representation

Feature Representation

Metric learning method: add negative pairs.

Metric learning

- Minimize the distance between positive pairs; maximize it otherwise.

Our goal: d(A,B;θ)<d(A,C;θ)`d(A, B; \theta) < d(A, C; \theta)`.

The loss:

θ:=argminθEA,BS+[dθ(A,B)]EB,CS[dθ(B,C)]\begin{equation} \theta^* := \arg \min_{\theta}E_{A, B \in S^+}[d_{\theta}(A, B)] - E_{B, C \in S^-}[d_{\theta}(B, C)] \end{equation}

where S+`S^+` and S`S^-` are sets of positive and negative image pairs.

  1. Hinge loss:

    L(A,B)=yf(A)f(B)2+(1y)max(0,m2f(A)f(B)2)\begin{equation} L(A, B) = y^*||f(A) - f(B)||^2 + (1-y^*)\max{(0, m^2 - ||f(A) - f(B)||^2)} \end{equation}

    - where y`y^*` is 1 if (A, B) is a positive pair, and 0 otherwise.

    - hinge loss for negative pairs with margin m`m`

  2. Triplet loss:

    L(A,B,C)=max(0,f(A)f(B)2f(A)f(C)2+m)\begin{equation} L(A, B, C) = \max(0, ||f(A) - f(B)||^2 - ||f(A) - f(C)||^2 + m) \end{equation}
    Triplet loss

Metric learning for tracking

  1. Train an embedding network on triplets of data:

    - positive pairs: same person at different timesteps;

    - negative pairs: different people.

  2. Use the network to compute the similarity score for matching.

Summary of metric learning

Online vs. Offline tracking

Online tracking:

- "Given observations so far, estimate the current state."

Offline tracking:

- "Given all observations, estimate any state."

Graph based MOT

Tracking with network flow

Minimum-cost maximum-flow problem:

Determine the maximum flow with a minimum cost.

MOT

Minimizing the cost:

f=argminfC(I,j)f(I,j)\begin{equation} f^* = \arg \min_f \sum{C(I, j)f(I, j)} \end{equation}

where f`f` is the disjoint set of trajectories, C(I,j)`C(I, j)` indicates the cost, and f(I,j)`f(I, j)` means the indicator 0,1`{0,1}`.

To incorporate detection confidence, we split the node in two.

Network flow

where Cdet`C_{det}` indicates the detection confidence, and Ct`C_t` indicates the transition cost.

Problem with occlusions, such as:

Solution: Connect all nodes(detections) to entrance/exit nodes:

Solution to occlusion

And the graph subjects to flow conservation:

Flow conservation

MAP formulation

- Our solution is a set of trajectories τ`\tau^*`

T=argmaxTP(TX)\begin{equation} \mathcal{T}^* = \arg\max_{\mathcal{T}} P(\mathcal{T} \mid \mathcal{X}) \end{equation} =argmaxTP(XT)P(T)\begin{equation} = \arg\max_{\mathcal{T}} P(\mathcal{X} \mid \mathcal{T}) P(\mathcal{T}) \end{equation} =argmaxTiP(xiT)P(T)\begin{equation} = \arg\max_{\mathcal{T}} \prod_i P(x_i \mid \mathcal{T}) P(\mathcal{T}) \end{equation} =argmaxTiP(xiT)TiTP(Ti)\begin{equation} = \arg\max_{\mathcal{T}} \prod_i P(x_i \mid \mathcal{T}) \prod_{T_i \in \mathcal{T}} P(T_i) \end{equation}

Bayes rule

Conditional independence of observations

Independence of trajectories

MAP fomulation:

argmaxTiP(xiT)TiTP(Ti)\begin{equation} \arg \max_{\mathcal{T}}\prod_{i}{P(x_i|\mathcal{T)}}\prod_{T_i \in \mathcal{T}}{P(T_i)} \end{equation} argminTilogP(xiT)TITlogP(Ti) log-space for optimization\begin{equation} \arg \min_{\mathcal{T}} - \sum_i\log{P(x_i|\mathcal{T})} - \sum_{T_I \in \mathcal{T}}\log{P(T_i)} \rightarrow \text{ log-space for optimization} \end{equation}

Prior

TiTlogP(Ti)\begin{equation} \sum_{T_i \in \mathcal{T}} \log P(T_i) \end{equation}

Trajectory model: count the entrance, exit and transition costs

Ti:=(x0,x1,...,xn)\begin{equation} T_i := (\mathbf{x}_0, \mathbf{x}_1, ..., \mathbf{x}_n) \end{equation} P(Ti)=Pin(x0)j=1,nPt(xjxj1)Pout(xn)\begin{equation} P(T_i) = P_{\text{in}}(\mathbf{x}_0) \prod_{j=1,n} P_t(\mathbf{x}_j \mid \mathbf{x}_{j-1}) P_{\text{out}}(\mathbf{x}_n) \end{equation} logP(Ti)=logPin(x0)j=1,nlogPt(xjxj1)logPout(xn)\begin{equation} -\log P(T_i) = -\log P_{\text{in}}(\mathbf{x}_0) - \sum_{j=1,n} \log P_t(\mathbf{x}_j \mid \mathbf{x}_{j-1}) - \log P_{\text{out}}(\mathbf{x}_n) \end{equation}

Entrance Cost: logPin(x0)=fin(x0)Cin(x0)`-\log P_{\text{in}}(\mathbf{x}_0) = f_{\text{in}}(\mathbf{x}_0)C_{\text{in}}(\mathbf{x}_0)`

Transition Cost: logPt(xjxj1)=ft(xj,xj1)Ct(xj,xj1)`\log P_t(\mathbf{x}_j \mid \mathbf{x}_{j-1}) = f_t(\mathbf{x}_j, \mathbf{x}_{j-1})C_t(\mathbf{x}_j, \mathbf{x}_{j-1})`

Exit Cost: logPout(xn)=fout(xn)Cout(xn)`-\log P_{\text{out}}(\mathbf{x}_n) = f_{\text{out}}(\mathbf{x}_n)C_{\text{out}}(\mathbf{x}_n)`

Likelihood:

ilogP(xiT)\begin{equation} - \sum_i \log P(\mathbf{x}_i \mid \mathcal{T}) \end{equation}

We can use Bernoulli distribution:

P(xiT):={γi,if TjT,xiTj1γi,otherwise\begin{equation} P(\mathbf{x}_i \mid \mathcal{T}) := \begin{cases} \gamma_i, & \text{if } \exists T_j \in \mathcal{T}, \mathbf{x}_i \in T_j \\ 1 - \gamma_i, & \text{otherwise} \end{cases} \end{equation}

γi`\gamma_i` denotes prediction confidence (e.g., provided by the detector)

logP(xiT)=f(xi)logγi(1f(xi))log(1γi)\begin{equation} -\log P(\mathbf{x}_i \mid \mathcal{T}) = - f(\mathbf{x}_i) \log \gamma_i - (1 - f(\mathbf{x}_i)) \log (1 - \gamma_i) \end{equation} =f(xi)log1γiγilog(1γi)\begin{equation} = f(\mathbf{x}_i) \log \frac{1 - \gamma_i}{\gamma_i} - \log (1 - \gamma_i) \end{equation}

Cdet(xi)=log1γiγi`C_{\text{det}}(\mathbf{x}_i) = \log \frac{1 - \gamma_i}{\gamma_i}`

log(1γi)`\log (1 - \gamma_i)` can be ignored in optimization

Optimization

(Cdet,Cin,Cout,Ct`C_{\text{det}}, C_{\text{in}}, C_{\text{out}}, C_t`) are estimated from data. Then:

Summary

Open questions:

Tracking with Message Passing Network

End-to-end learning?

Setup

Deep learning on graphs

Key challenges:

Message Passing Network

Message passing

  1. Initial graph

    • Graph: G=(V,E)`G = (V, E)`

    • Initial embeddings:

      - Node embedding: hi(0),IV`h_i^{(0)}, I \in V`

      - Edge embedding: hi,j(0),(i,j)E`h_{i, j}^{(0)}, (i, j) \in E`

    • Embeddings after l`l` steps: hi(l),iV`h_i^{(l)}, i \in V` h(i,j)(l),(i,j)E`h_{(i, j)}^{(l)}, (i, j) \in E`

  2. "Node-to-edge" update

    Node-to-edge update
  3. "Edge-to-node" update

    1. Use the updated edge embeddings to update nodes:

      image

    2. After a round of edge updates, each edge embedding contains information about its pair of incident nodes.

    3. By analogy: hi(l)Nv([hi(l1),h(i,j)l])`h_i^{(l)} - \mathcal{N}_v([h_i^{(l-1)}, h_{(i, j)}^l])`

    4. In general, we may have an arbitrary number neighbors("degree", or "valency")

    5. Define a permutation-invariant aggregation function:

      Φ(l)(i):=Φ({h(l)(i,j)}jNe)\begin{equation} \Phi^{(l)}(i) := \Phi(\{h^{(l)}(i, j)\}_{j\in \mathcal{N}e}) \end{equation}

      where the input is a set of embeddings from incident edges.

    6. Re-define the edge-to-node updates for a general graph:

      image

      hi(l)=Nv(hi(l1),Φ(l)(i))\begin{equation} h_i^{(l)} = \mathcal{N}v(h_i^{(l-1)}, \Phi^{(l)}(i)) \end{equation}

Remarks:

MOT with MPN

MOT with MPN
  1. Input

  2. Graph construction + feature encoding

    - Encode appearance and scene geometry cues into node and edge embeddings.

  3. Neural message passing

    - Propagate cues across the entire graph with neural message passing.

  4. Edge classification

    - Learn to directly predict solutions of the Min-cost flow problem by classifying edge embeddings.

  5. Output

Feature encoding

Appearance and geometry encodings:

Geometry encodings
  1. Relative Box Position

    (2(yjyi)hi+hj,2(xjxi)wi+wj)\begin{equation} \left( \frac{2(y_j - y_i)}{h_i + h_j}, \frac{2(x_j - x_i)}{w_i + w_j} \right) \end{equation}
    • This encodes the normalized relative position of two bounding boxes.

    • The difference in the y-coordinate is divided by the sum of their heights.

    • The difference in the x-coordinate is divided by the sum of their widths.

    • The factor 2 ensures the values are in a normalized range.

  2. Relative Box Size

    (loghihj,logwiwj)\begin{equation} \left( \log \frac{h_i}{h_j}, \log \frac{w_i}{w_j} \right) \end{equation}
    • These terms encode the relative scale change between the two bounding boxes.

    • Using a logarithm makes it more robust to size differences.

  3. Time Difference

    tjti\begin{equation} t_j - t_i \end{equation}
    • This represents the time gap between the two detections.

    • It helps in determining whether two bounding boxes belong to the same object across frames.

*Shared weights of CNN and MLP for all nodes and edges

Contrast:

Temporal causality

Flow conservation at a node

Time-aware message passing:

Time-aware message passing

Classifying edges

Obtaining final solutions

Summary

Evaluation of MOT

Compute a set of measures per frame:

Metrics:

Segmentation

Segmentation

Flavors of image segmentation

K-means(clustering)

  1. Initialize(randomly) K cluster centers

  2. Assign points to clusters

    - using a distance metric

  3. Update centers by averaging the points in the cluster

  4. Repeat 2 and 3 until convergence.

Problem: The Euclidean distance may be insufficient. Instead, we want to cluster based on the notion of "connectedness".

Spectral clustering

Construct a Similarity Graph:

Compute the Graph Laplacian:

Optimization problem:

The Laplacian quadratic form measures how much variation exists between connected nodes.

xTLx=12i,jAi,j(xixj)2\begin{equation} x^TLx = \frac{1}{2}\sum_{i, j}A_{i,j}(x_i - x_j)^2 \end{equation}

where

Intuitively, if

xTLx=12i,jAi,j(xixj)20\begin{equation} x^TLx = \frac{1}{2}\sum_{i, j}A_{i,j}(x_i - x_j)^2 \approx 0 \end{equation}

then,

The solutions x=argminxxTLx`x^* = \arg \min_x x^TLx` are eigenvectors corresponding to the second smallest eigenvalues of L`L`.

Special case:

  1. Zero eigenvalue

    xTLx=12i,jAi,j(xixj)2=0,Aij for all i, j\begin{equation} x^TLx = \frac{1}{2}\sum_{i, j}A_{i,j}(x_i - x_j)^2 = 0, A_{ij} \text{ for all i, j} \end{equation}

    What can we say about the corresponding eigenvector?

  2. Connected components

    - Proposition: the multiplicity k`k` of eigenvalue 0 equals to the number of connected components, spanned by indicator vectors.

    - Example with two components(n×2)`n \times 2)`:

    Example
Spectral clustering in practice

Normalized cut(Ncut)

Spectral clustering as a min cut as Fig 54

Spectral clustering as a min cut

Problem with min cuts: Favour unbalanced isolated clusters:

Poor min cut

Balanced cut

Ncut(A,B)=cut(A,B)(assoc1(A,V)+assoc1(B,V))\begin{equation} Ncut(A, B) = cut(A, B)(assoc^{-1}(A, V) + assoc^{-1}(B, V)) \end{equation}

where

Ncut

  1. Define a graph G:=(V,E)`G:= (V, E)`

    • V is the set of nodes representing pixels.

    • E defines similarities of two nodes.

  2. Solves a generalized eigenvalue problem: (DW)x=λDx`(D - W)x = \lambda Dx`

    where

    • d(i)=jwi,j`d(i) = \sum_{j}w_{i,j}`

    • (DW)`(D-W)` is the Laplacian matrix

    • equivalently D12(DW)D12x=λx`D^{-\frac{1}{2}}(D-W)D^{-\frac{1}{2}}x = \lambda x`

  3. Use the eigenvector with the 2nd smallest eigenvalue to cut the graph.

  4. Recurse if needed(i.e. subdivide the two node groups).

Energy-based model: Conditional random fields(CRFs)

Energy function:

E(x,y)=iϕ(xi,yi)+i,jψ(xi,xj)\begin{equation} E(x, y) = \sum_i\phi(x_i, y_i) + \sum_{i, j}\psi(x_i, x_j) \end{equation}

where

Conditional Random Fields

Boykov and Jolly (2001)

E(x,y)=iφ(xi,yi)+ijψ(xi,xj)E(x,y) = \sum_i \varphi(x_i, y_i) + \sum_{ij} \psi(x_i, x_j)

Variables

Unary term

Pairwise term

Optimization with graph cuts:

Max-flow min-cut theorem: The maximum value of an s-t flow is equal to the minimum capacity over all s-t cuts.

Grid structured random fields

Fully connected models:

Using fully connected CRFs:

Fully connected CRFs

Fully convolutional neural networks

Deep networks for classification:

Classification networks

To extend this to segmentation:

Extended networks for segmentation
  1. Remove GAP:

    • Increase number of parameters

    • Fixed input size only

    • No translation invariance

  2. Replace fully connected layer with convolution(1×1`1 \times 1` convolution)

    • Few parameters

    • Variable input size

    • Translation invariance

  3. Upsample the last layer output to the original resolution.

  4. Can be trained with pixel-wise cross-entropy with SGD.

1×1`1\times 1` convolution

\(1\times1\) conv

We may try to maintain the original image resolution in the encoder:

Output stride: input resolution/output resolution.

Decreasing the output stride improves segmentation accuracy.

The problems of keeping feature resolution high in all layers:

Dilated convolutions: To maintain the receptive field size.

Problem 2: Computational footprint(both memory and FLOPs):

Solution: Upsampling

Recall convolution(as matrix multiplication):

Conv as matrix multiplication

We can obtain the opposite effect(increase the output size) by:

X=WTX Broadcasting\begin{equation} X = W^TX' \rightarrow \text{ Broadcasting} \end{equation}

Transposed convolution:

Upsampling: Interpolation

Resize-convolution:

U-Net

To mitigate information loss, we apply the first layers in the encoder via a skip connection.

U-Net

U-Net’s key unique features:

SegNet

SegNet

SegNet is a deep learning architecture primarily used for image segmentation tasks. It is similar to U-Net in that it uses an encoder-decoder structure, but it has some unique features that distinguish it.

Best practices for segmentation

Instance segmentation

  1. Label every pixel, including the background(sky, grass, road).

  2. Does not differentiate between the pixels from objects(instances) of the same class.

  3. Do not label pixels from uncountable objects("stuffs"), e.g., "sky", "grass", "road".

  4. Differentiates between the pixels coming from instances of the same class.

Instance segmentation methods

Proposal based method: Mask R-CNN

Start with Faster R-CNN

Mask R-CNN

RoIPool vs. RoIAlign

RoIAlign:

RoIAlign

Problem of Mast R-CNN: Low mask resolution.

Mask R-CNN with PontRend

Remarks:

Proposal-free method

We obtain a semantic map using Fully convolution networks for semantic segmentation.

SOLOv2

Idea: Predict the kernels:

SOLOv2

Instance segmentation: Summary

Panoptic segmentation

Typical architecture

Challenges:

Two broad categories:

Panoptic FPN

Panoptic FPN

Remark:

Training with multiple loss terms("multi-task learning") can be challenging, as different loss terms may "compete" for desired feature representation.

Panoptic FPN: Summary

Panoptic FCN

Even simpler?

Panoptic FCN
Panoptic FCN

Panoptic evaluation

Video object segmentation

Goal: Generate accurate and temporally consistent pixel masks for object in a video sequence.

Challenges:

What we need:

  1. Appearance model:

    • Assumption: constant appearance.

    • Input: 1 frame.

    • Output: segmentation mask.

  2. Motion model(maybe optional):

    • Assumption: smooth displacement; bright constancy.

    • Input: 2 frames.

    • Output: motion(optical flow).

Semi-supervised(one-shot) VOS task formulation:

Motion-based VOS

Optical flow:

Minimize brightness difference:

Brightness difference
ESSD(u,v)=(x,y)R(I(x+u,y+v,t+1)I(x,y,t))2\begin{equation} E_{SSD}(u, v) = \sum_{(x, y) \in R}(I(x+u, y+v, t+1) - I(x, y, t))^2 \end{equation}

Lucas-Kanade:

[uv]=[RIx2RIxIyRIxIyRIy2]1[RIxItRIyIt]\begin{bmatrix} u \\ v \end{bmatrix} = \begin{bmatrix} \sum_R I_x^2 & \sum_R I_x I_y \\ \sum_R I_x I_y & \sum_R I_y^2 \end{bmatrix}^{-1} \begin{bmatrix} -\sum_R I_x I_t \\ -\sum_R I_y I_t \end{bmatrix}

This is a classical flow technique - Lucas-Kanade method.

Joint formulation of segmentation and optical flow estimation:

VOS with optical flow

FlowNet: Architecture 1

FlowNet: Architecture 2 `\rightarrow` Siamese architecture

Siamese architecture

Correlation layer:

Motion-based VOS: Summary

Appearance-only VOS

Main idea:

One-shot VOS(OSVOS): Separate the training steps

OSVOS

Drifting problem:

OnAVOS: Online Adaptation

OnAVOS

MaskTrack

MaskTrack

Traning inputs can be simulated

Appearance-only VOS: Summary

Advantages of appearance-based models:

Disadvantages:

Metric-based approaches

Pixel-wise retrieval:

Transformers

CNNs: A few relevant facts

Universality: Developing components that work across all possible settings:

Self-attention: A hash table analogy

Consider a hash table:

Putting it all together:

Attention(Q,K,V)=softmax(QKTn)V\begin{equation} Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{n}})V \end{equation}

Scaling factor 2`\sqrt{2}` eases optimization especially for large n`n`

Linear projection

Complexity

Multi-head attention

Normalization:

Further improvements:

Recap:

Transformers

Transformer-based encoder:

Encoder

Vision Transformers

An all-round transformer architecture:

Steps:

  1. Split an image into fixed-sized patches.

  2. Linearly embed each patch(a fully connected layer).

  3. Attach an extra "[class]" embedding(learnable).

  4. Add positional embeddings.

  5. Feed the sequence to the standard Transformer encoder.

  6. Predict the class with an MLP using the output [class] token.

image]

Vision Transformer

Experiment with ViT

Swin Transformers

Recall ViT

Swin Transformer:

A naive implementation will hurt context reasoning:

At any given level(except the last one), our context is not global anymore:

Naive Swin

Solution: Alternate the layout of local windows:

Mature Swin

Successively decrease the number of tokens using patch merging:

Swin Transformer

Results:

Summary:

Transformer for detection?

DETR

DETR

A closer look:

DETR

The losss function

LHungarian(y,y^)=i=1N[logp^σ^(i)(ci)+1{ci}Lbox(bi,b^σ^(i))]\begin{equation} \mathcal{L}_{Hungarian}(y, \hat{y}) = \sum^N_{i=1}[-\log{\hat{p}_{\hat{\sigma}(i)}(c_i)} + \mathbf{1}_{\{c_i \neq \emptyset\}}\mathcal{L}_{box}(b_i, \hat{b}_{\hat{\sigma}}(i))] \end{equation}

where

The bounding box loss:

λiouLiou(bi,b^σ(i))+λL1bib^σ(i)1\begin{equation} \lambda_{iou}\mathcal{L}_{iou}(b_i, \hat{b}_{\sigma(i)}) + \lambda_{L1}||b_i - \hat{b}_{\sigma(i)}||_1 \end{equation}

DETR: Summary

MaskFormer

MaskFormer

A unified model for semantic and panoptic segmentation.

Recall: Panoptic FCN

Panoptic FCN

MaskFormer’s idea: Compute the kernels using learnable queries with a Transformer.

Works well, but has troubles with small objects/segments.

Mask2Former

Idea: Attend with self-attention at multiple scales of the feature hierarchy:

Mask2Former

Masked attention:

Mask2Former: Summary

Conclusions

Unsupervised

Learning without labels

Compact, yet descriptive representation?

Consider the loss:

L=I(X;Z)βI(Z;Y)\begin{equation} \mathcal{L} = I(X; Z) - \beta I(Z;Y) \end{equation}
Unsupervised loss

where

Recap: Mutual Information(MI)

I(X;Y)=xXyYP(X,Y)(x,y)logP(X,Y)(x,y)PX(x)PY(y)\begin{equation} I(X;Y) = \sum_{x\in \mathcal{X}}\sum_{y \in \mathcal{Y}}P_{(X, Y)}(x, y) \log \frac{P_{(X, Y)}(x, y)}{P_X(x)P_Y(y)} \end{equation}

After obtaining compact and descriptive representation, we can apply it for different tasks, such as detection, classification, segmentation, etc. by adding task-specific shallow models(e.g., linear projection).

Evaluating Self-supervised learning(SSL) models

Pretext tasks

Idea: Solve a different task with available(generated) supervision.

Caveats:

Pretext task: Rotation

Task: predicting image rotation.

Training process outline

  1. Quantize rotation angles(e.g., 0, 90, 180, 270 - 4 classes).

  2. Sample an image from the training dataset.

  3. Rotate the image by one of the pre-defined angles.

    - This defines our "ground-truth" label.

  4. Train the network to predict the correct rotation class.

Pretext task: Rotation

Pretext task: Jigsaw puzzle

Jigsaw puzzle

Pretext task: Colorization

Predicting the original color of the images(in CIELAB color space):

Intuition: Proper colorization requires semantic understanding of the image.

Colorization

Nuances:

Contrastive learning

Recall metric learning

L(A,B,C)=max(0,f(A)f(B)2f(A)f(C)2+2\begin{equation} \mathcal{L}(A, B, C) = \max{(0, ||f(A) - f(B)||^2 - ||f(A) - f(C)||^2 + 2} \end{equation}

Intuitive idea:

Metric learning

We need labels to define positive and negative pairs.

Contrastive learning is an extension of metric learning to unsupervised scenarios.

Idea:

Example:

Intuition

Deep frameworks for SSL

Non-contrastive learning

DINO

DINO

DINOv2

DINOv2

Masked Autoencoders

Unsupervised learning with Transformers and a reconstruction loss:

Masked autoencoders

Reconstructing HoG features, instead of pixel values:

Reconstructing HoG features

Remarks:

Downstream applications

What do DINO features encode?

DINO(and other SSL methods) provides semantic correspondence(almost) out-of-the-box.

Extracted features
Extracted features

We can also cluster "background" (stuff) areas, hence obtain semantic segmentation.

High level idea: Learn a lower dimensional embedding, S(f(i),f(j))`S(f(i), f(j))`, such that clustering in this space yields semantic masks.

Extracted features

Self-supervision in videos

Two groups of problems:

Segmentation from motion cues Idea: If the object mask is correct, we cannot reconstruct object-related optical flow. We train two networks

Contrastive random walk

Forward-backward cycle consistency:

Conclusion

Semi-supervised learning

Training on labeled and unlabeled data.

Semi-supervised

General remarks:

Small print:

  1. Improvement is not always guaranteed.

    - It depends on the model, the technique used and the unlabeled data.

  2. Semi-supervised techniques are often complementary.

    - A combination of multiple techniques yields the best results(though make the framework more complex).

A practical perspective:

Semi-loss

Assumptions

Assumptions about semi-supervised learning:

Smoothness assumption

If two input points are close by, their labels should be the same.

Transitivity

Low-density assumption

The decision boundary should pass through a region with low density p(x)`p(x)`.

Manifold assumption

Remark:

Which assumptions to make depends on what we know about how our data distribution p(x)`p(x)` interacts with the class posterior p(yx)`p(y|x)`.

Two taxonomies

How unlabeled data is used:

Unsupervised pre-processing

- pre-training, clustering, etc.;

- Two stages:

Wrapper methods

Self-training

A single classifier trained jointly on labeled and self-labeled data from the unlabeled dataset.

OnAVOS:

OnAVOS

Online adaptation: Adapt model to appearance changes in every frame, not just the first frame.

Drawback: Can be slow.

Online Adaptation
OnAVOS with unlabeled dataset

Segment Anything (2)

Self training with pseudo labels

A general outline:

  1. Train a strong baseline on the labeled set;

    - e.g. with heavy data augmentation(crops, photometric noise).

  2. Predict pseudo-labels for the unlabeled set.

  3. Continue training the network on both labeled and pseudo-labeled samples.

  4. Repeat steps 2-3.

Intrinsically semi-supervised

Entropy minimization

Virtual adversarial network

Idea: Perturbations of the input should not change the label.

Learning from synthetic data

Labeled and unlabeled data may come from different distributions.

e.g. due to differences in the synthetic and real appearances.

Domain alignment

This translates into disjoint feature distribution of a model trained only on the labeled data:

Domain alignment

Solution: "Align" the two distributions.

Domain alignment means making two feature distributions indistinguishable.

We can use a GAN:

GAN

Consistency regularization

Consistent prediction across image transformations.

Semantic meaning does not change, though not guaranteed in a deep net `\rightarrow` Use it as a consistency loss.

Framework:

Araslanov and Roth

Momentum net

Test-time augmentation is applied online at training time:

Momentum net

Limited supervision

Limited supervision

Weakly supervised learning: training coarse labels:

Segmentation with image-level labels

Idea: we can reuse the classifier weights to classify every pixel.

Summary

Computer Vision 3

Introduction and Object Detection

What this course is

Hight-level(semantic) computer vision:

From another perspective this is the intersection of Computer vision, deep learning and real-world applications, as shown in Fig 1.

Another perspective

What is computer vision:

Semantic scene understanding:

This course in two dimension is as shown in Fig 2.

CV III in two Dimensions

Understanding an image

Different representations depending on the granularity

Understanding a video

Why use the temporal domain

But challenges:

Some architectures and concepts

Object detection

One-stage detectors

One-stage detectors treat object detection as a single task, directly predicting the object class and bounding box coordinates from the input images.

Object detection with sliding window

For every position, measure the distance(or correlation) between the template and the image region(See Fig 3):

One-stage detector(old)
L(x0,y0)=d(I(x0,y0),T)\labeltemplatesimilarity\begin{equation} L(x_0, y_0) = d(I_{(x_0, y_0)}, T) \label{template similarity} \end{equation}

where d`d` is the distance metric, I(x0,y0)`I_{(x_0, y_0)}` is the image region, and T`T` is the template.
Template matching distances

Disadvantages

Feature-based detection

Idea: Learn feature-based classifiers invariant to natural object changes(See Fig 4)

One-stage detector(new)

Viola-Jones detector

Given data (xi,yi)`(x_i, y_i)`

  1. Define a set of Haar-like features(Fig 6)

    Haar-features are sensitive to directionality of patterns

    Haar features
  2. Find a weak classifier with the lowest error across the dataset

  3. Save the weak classifier and update the priority of the data samples

  4. Repeat Steps 2-3 N`N` times

Final classifier is the linear combination of all weak learners.
Histogram of Oriented Gradients

Average gradient image over training samples `\rightarrow` gradients provide shape information.

HOG descriptor `\rightarrow` Histogram of oriented gradients.

Compute gradients in dense grids, compute gradients and create a histogram based on gradient direction.

  1. Choose your training set of images that contain the object you want to detect.

  2. Choose a set of images that do NOT contain that object.

  3. Extract HOG features from both sets.

  4. Train an SVM classifier on the two sets to detect whether a feature vector represents the object of interest or not(0/1 classification)

Deformable Part Model

Many objects are not rigid, so we could use a Bottom-up approach:

  1. detect body parts

  2. detect "person" if the body parts are in correct arrangement

Note: The amount of work for each RoI may grow significantly.

Two-stage detectors

Two-stage object detectors split the object detection task to two parts(Fig 7):

  1. Region Proposal: The network first identifies potential regions in the image that may contain objects.

  2. Refinement: These regions are further analyzed to classify the objects and refine their bounding boxes.

2-stage detection

A generic, class-agnostic objectness measure: object proposals or regions of interest(RoI)

Object proposal methods:

  1. Selective search

    - Using class-agnostic segmentation at multiple scales.

  2. Edge boxes

    - Bounding boxes that wholly enclose detected contours.

Non-Maximum Suppression(NSM)

Method to keep only the best proposals(See algorithm [alg:nms]).

Bnms`B_{\text{nms}} \gets \emptyset` discardFalse`\text{discard} \gets \text{False}` discardTrue`\text{discard} \gets \text{True}` BnmsBnmsbi`B_{\text{nms}} \gets B_{\text{nms}} \cup b_i` Bnms`B_{\text{nms}}`

Region overlap

We measure region overlap with the Intersection over Union(IoU) or Jaccard Index:

J(A,B)=ABABJ(A, B) = \frac{|A \cap B|}{A \cup B}

The threshold is the decision boundary used to determine whether a detected object or prediction should be considered valid. For example, in object detection:

Object detection with deep networks

Overfeat

Explores three well-known vision tasks using a single framework.

Training convolution on all tasks boosts up the accuracy, see Fig 8.

Overfeat

Apply Non-Max Suppression to combine the predictions and windows, see Fig 9.

NMS

Improved detection accuracy is largely due to feature representation learned with deep nets.

But cons:

The complexity of a sliding window is O(ND)`O(ND)`, where N`N` is the number of all the windows, and D`D` represents the amount of work needed to perform detection and classification for one window. The complexity of the "pre-filtered" method is O(Nd+nD)`O(Nd + nD)`, where d`d` is the amount of work needed for filtering each window, and n`n` is the number of windows left after being filtered.

"Pre-filtering" method pays off when:

O(ND)>O(Nd+nD)\begin{equation} O(ND) > O(Nd + nD) \end{equation}

Assume the constant factors are comparable/negligible:

nN+dD<1\begin{equation} \frac{n}{N} + \frac{d}{D} < 1 \end{equation}

where nN`\frac{n}{N}` is the Region-of-Interest(RoI) ratio, and dD`\frac{d}{D}` is the efficiency of RoI generator. In practice, there is a delicate balance between n`n` and d`d`: Reducing d`d` and increasing n`n`.

Object proposals(Pre-filtering) and pooling for two-stage detection

Heuristic-based object proposal methods:

  1. Selective search

    - Using class-agnostic segmentation at multiple scales.

  2. Edge boxes

    - Bounding boxes that wholly detected contours

R-CNN family(Regions with CNN features)

R-CNN

Training scheme:

  1. Pre-train the CNN for image classification(ImageNet)

  2. Finetune the CNN on the number of classes the detector is aiming to classify

  3. Train a linear SVM classifier to classify image regions - one linear SVM per class

  4. Train the bounding box regressor

Pros:

Cons:

Problems:

  1. Input image has a fixed size

    FC layer prevents us from dealing with any image size.

  2. TBD

SPP-Net

Fast R-CNN

Fast R-CNN = R-CNN + RoI Pooling(Single layer SPP-Net)

Fast R-CNN

RoI Pooling RoI Pooling is a technique introduced in the Fast R-CNN framework to efficiently extract fixed-size feature representations for arbitrary regions(bounding boxes) in a feature map. It address the problem of feeding region proposals of various sizes into a neural network in a way that:

1. Feature extraction for the entire image

Instead of cropping individual region proposals from the original image and passing them each though a ConvNet(as done in older approaches like R-CNN), we:

This means the network only processes the image once, which is much more efficient.

2. Mapping region proposals to the feature map

You have a set of region proposals(bounding boxes) in the original image coordinates-these come from an external region proposal method(e.g., selective search) or from a region proposal network(in Faster R-CNN).

Each region proposal is then mapped onto the feature map coordinates:

- Because the CNN reduces spatial resolution(due to pooling layers or strided convolutions), the coordinates of each region of the original image need to be scaled to align with the coordinate system of the smaller feature map.

3. Dividing each RoI into Sub-regions

To handle different RoI sizes but produce a fixed-size output(for example 7X7 output grid in Fast R-CNN):

4. Pooling operation

Within each of these sub-regions(bins):

By doing this for each subregion in the grid, you transform the entire RoI into a fixed-size feature map(e.g., 7X7), regardless of the original size of the bounding box.

5. Feeding pooled features to classifier layers

Note that you have a fixed spatial dimension(e.g., 7X7), you can:

Result of Fast R-CNN

Fast R-CNN results

Faster R-CNN

Faster R-CNN

Faster R-CNN = Fast R-CNN + Region proposal network

Region Proposal Network: We fix the number of proposals by using a set of n=9`n = 9` anchors per location.

9 anchors=3 scales ×3 aspect ratios\begin{equation} 9 \text{ anchors} = 3 \text{ scales } \times 3 \text{ aspect ratios} \end{equation}

A 256-d descriptor characterizes every location.

For feature map location, we get a set of anchor corrections and classification into object/non-object.

RPN: Training

Classification ground truth: We compute p`p^*` which indicates how much an anchor overlaps with the ground truth bounding boxes:

p=1 if IoU>0.7\begin{equation} p^* = 1 \text{ if } IoU > 0.7 \end{equation} p8=0 if IoU<0.3\begin{equation} p^8 = 0 \text{ if } IoU < 0.3 \end{equation}

Faster R-CNN training:

- First implementation, training of RPN separate from the rest.

- Now we can train jointly!

- Four losses:

  1. RPN classification(object/non-object)

  2. RPN regression(anchor – proposal)

  3. Fast R-CNN classification(type of object)

  4. Fast R-CNN regression(proposal – box)

Faster R-CNN Performance

Feature Pyramid Networks

CNNs are not scale-invariant.

Straightforward implementation:

Integration with RPN for object detection:

Pros:

Cons: Increased model complexity.

Single-stage object detection

YOLO

YOLO

Inference time:

YOLO Inference

It is more efficient than Faster R-CNN, but less accurate.

SSD

Pros:

Cons:

Single shot detection

RetinaNet

Problems with one-stage detectors

Two-stage detectors:

Problems with one-stage detectors:

Class imbalance:

- Idea: balance the positives/negatives in the loss function.

- Recall cross-entropy loss:

CE loss

Focal loss

Replace CE with focal loss(FL):

Focal loss

RetinaNet

Retina

Spatial Transformers

Features of Spatial pooling:

Solution: Spatial Transformers.

Equivariance is needed:

f(A(x))=A(f(x))\begin{equation} f(A(x)) = A(f(x)) \end{equation}

Cons:

- Difficulty of training for generalizing to more challenging scenes.

Detection evaluation

Precision: How many zebras you found are actually zebras?

Precision =TPTP+FP\begin{equation} \text{Precision } = \frac{TP}{TP + FP} \end{equation}

Recall: How many actual zebras in the image/dataset you could find?

Recall =TPTP+FN\begin{equation} \text{Recall } = \frac{TP}{TP + FN} \end{equation}

What is a true positive?

Resolving conflicts

All-in-one metric: Average precision(AP)

There is often a trade-off between Precision and Recall.

AP

AP = the area under the Precision-Recall curve(AUC)

Computing average precision

  1. Sort the predicted boxes by confidence score

  2. For each prediction find the associated ground truth

    - The ground truth should be unassigned and pass IoU threshold

  3. Compute cumulative TP and FP:

    - TP: [1, 0, 1] FP: [0, 1, 0] `\rightarrow` cTP = [1, 1, 2], cFP = [0, 1, 1]

  4. Compute precision and recall(#GT = 3 `\rightarrow` TP + FN + 3)

    - Precision: [1/1,1/2,2/3]`[1/1, 1/2, 2/3]`; Recall(#GT=3): [1/3,1/3,2/3]`[1/3, 1/3, 2/3]`

  5. Plot the precision-recall curve and the area beneath(num. integration).

mAP is the average over object categories

Single object tracking

Problem statement: Given observations at t`t`, and find their correspondence at t+1(n0)`t+1(n \neq 0)`

Problem state

Challenges:

A simple solution: Tracking by detection.

- Detect the object in every frame.

Problem: data association.

Bayesian tracking

Bayesian example

Goal: Estimate car position at each time instant (say, the white car).

Observation: Image sequence and known background.

Observations: Image

System state: Car position(x,y`x, y`)

Bayesian probabilities

Notations:

Goal:

Estimate posterior probability p(xkZk)`p(x_k|Z_k)`.

How? Recursion:

p(xk1Zk1)p(xkZk)\begin{equation} p(x_{k-1}|Z_{k-1}) \rightarrow p(x_k|Z_k) \end{equation}

Hidden Markov model

Two assumptions of HMM:

  1. p(zkxk,Zk1)=p(zkxk)\begin{equation} p(z_k|x_k, Z_{k-1}) = p(z_k|x_k) \end{equation}
  2. p(xkxk1,Zk1)=p(xkxk1)\begin{equation} p(x_k|x_{k-1}, Z_{k-1}) = p(x_k|x_{k-1}) \end{equation}

Recursive Estimation

p(xkZk)=p(xkzk,Zk1)(Applying Bayes’ theorem)p(zkxk,Zk1)p(xkZk1)p(zkxk)p(xkZk1)(Assumption: p(zkxk,Zk1)=p(zkxk))p(zkxk)p(xk,xk1Zk1)dxk1(Marginalization)p(zkxk)p(xkxk1,Zk1)p(xk1Zk1)dxk1p(zkxk)p(xkxk1)p(xk1Zk1)dxk1(Assumption: p(xkxk1,Zk1)=p(xkxk1))\begin{align*} p(x_k | \mathbf{Z}_k) &= p(x_k | z_k, \mathbf{Z}_{k-1}) \quad \text{(Applying Bayes' theorem)} \\ &\propto p(z_k | x_k, \mathbf{Z}_{k-1}) \cdot p(x_k | \mathbf{Z}_{k-1}) \\ &\propto p(z_k | x_k) \cdot p(x_k | \mathbf{Z}_{k-1}) \quad \text{(Assumption: $p(z_k | x_k, \mathbf{Z}_{k-1}) = p(z_k | x_k)$)} \\ &\propto p(z_k | x_k) \cdot \int p(x_k, x_{k-1} | \mathbf{Z}_{k-1}) \, dx_{k-1} \quad \text{(Marginalization)} \\ &\propto p(z_k | x_k) \cdot \int p(x_k | x_{k-1}, \mathbf{Z}_{k-1}) \cdot p(x_{k-1} | \mathbf{Z}_{k-1}) \, dx_{k-1} \\ &\propto p(z_k | x_k) \cdot \int p(x_k | x_{k-1}) \cdot p(x_{k-1} | \mathbf{Z}_{k-1}) \, dx_{k-1} \quad \text{(Assumption: $p(x_k | x_{k-1}, \mathbf{Z}_{k-1}) = p(x_k | x_{k-1})$)} \end{align*}

Key Concepts:

Bayesian formulation:

p(xkZk)=kp(zk,xk)p(xkxk1)p(xk1Zk1)dxk1\begin{equation} p(x_k|Z_k) = k \cdot p(z_k, x_k) \cdot \int{p(x_k|x_{k-1}) \cdot p(x_{k-1}|Z_{k-1})dx_{k-1}} \end{equation}

Estimators:

Assume the posterior probability p(xkZk)`p(x_k|Z_k)` is known:

Deep networks:

Online vs. Offline tracking

Online tracking:

- "Given observations so far, estimate the current state."

Offline tracking:

- "Given all observations, estimate any state."

An online tracking model can be used for offline tracking too. Our recursive Bayesian model will still work.

GOTURN

GOTURN

Temporal prior:

p(xkxk1)=δ(xkxk1)\begin{equation} p(x_k|x_{k-1}) = \delta(x_k - x_{k-1}) \end{equation}

where δ`\delta` is Dirac delta function.

Temporal prior

Advantages:

Disadvantages:

Online adaptation

Problem: The initial object appearance may change drastically,

- e.g., due to occlusion, pose change, etc.

Idea: adapt the appearance model on the (unlabeled) test sequence.

MDNet

Tracking with online adaptation

MDNet

Multi-object tracking

Challenges:

DL’s role in MOT:

  1. Tracking initialization(e.g., using a detector.)

    - Deep learning `\rightarrow` more accurate detectors.

  2. Prediction of the next position(motion model).

    - We can learn a motion model from data.

  3. Matching predictions with detections:

    - Learning robust metrics(robust to pose changes, partial occlusions, etc.).

    - Matching can be embedded into the model.

Approach 1

  1. Track initialization(e.g., using a detector).

  2. Prediction of the next position(motion model).

    - Classic: Kalman filter(e.g., state transition model.)

    - DL(data driven): LSTM/GRU.

  3. Matching predictions with detections(appearance model).

    - In general: reduction to the assignment problem.

    - Bipartite matching.

Typical models of dynamics

Bipartite matching

Bipartite matching
  1. Define distances between boxes(e.g., 1 - IoU).

    - We obtain N×N`N \times N` matrix.

  2. Solve the assignment.

    - Using Hungarian algorithm O(N3)`O(N^3)`

  3. The bipartite matching solution corresponds to the minimum total cost.

What happens if one detection is missing?

  1. Add a pseudo detection with a fixed cost(e.g., 0).

  2. Run the Hungarian method as before.

  3. Discard the assignment to the pseudo node.

What happens if no prediction is suitable?

Pseudo Node

Approach 2: Tracktor

  1. Detect objects in frame k1`k-1`(e.g., using an object detector)

  2. Initialize in the same location in frame j.

  3. Refine predictions with regression.

  4. Run object detection again to find new tracks.

Tractor

Advantages:

  1. We can reuse well-engineered object detectors:

    - The same architecture of regression and classification heads.

  2. Work well even if trained on still images:

    - The regression head is agnostic to the object ID and category.

Disadvantages:

  1. No motion model:

    - problems due to large motions(camera, objects) or low frame rate.

  2. Confusion in crowded spaces:

    - Since there is no notion of "identity" in the model.

  3. Temporary occlusions(the track is "killed"):

    - Generally applies to all online trackers.

    - Partial remedy: Long-term memory of tracks(using object embeddings).

Problem of using IoU as metric:

The implicit assumption of small motion.

We need a more robust metric.

Metric learning: For re-identification(Re-ID)

Metric spaces

Definition:

A set X`X`(e.g., containing images) is said to be a metric space if with any two points p`p` and q`q` of X`X` there is associated a real number d(p,q)`d(p, q)`, called the distance from p`p` to q`q`, such that

Any function with these properties is called a distance function, or a metric.

Let’s reformulate:

d(p,q)=dω(fθ(p),fθ(q))\begin{equation} d(p, q) = d_{\omega}(f_{\theta}(p), f_{\theta}(q)) \end{equation}

How do we train a network to learn a feature representation

Feature Representation

Metric learning method: add negative pairs.

Metric learning

- Minimize the distance between positive pairs; maximize it otherwise.

Our goal: d(A,B;θ)<d(A,C;θ)`d(A, B; \theta) < d(A, C; \theta)`.

The loss:

θ:=argminθEA,BS+[dθ(A,B)]EB,CS[dθ(B,C)]\begin{equation} \theta^* := \arg \min_{\theta}E_{A, B \in S^+}[d_{\theta}(A, B)] - E_{B, C \in S^-}[d_{\theta}(B, C)] \end{equation}

where S+`S^+` and S`S^-` are sets of positive and negative image pairs.

  1. Hinge loss:

    L(A,B)=yf(A)f(B)2+(1y)max(0,m2f(A)f(B)2)\begin{equation} L(A, B) = y^*||f(A) - f(B)||^2 + (1-y^*)\max{(0, m^2 - ||f(A) - f(B)||^2)} \end{equation}

    - where y`y^*` is 1 if (A, B) is a positive pair, and 0 otherwise.

    - hinge loss for negative pairs with margin m`m`

  2. Triplet loss:

    L(A,B,C)=max(0,f(A)f(B)2f(A)f(C)2+m)\begin{equation} L(A, B, C) = \max(0, ||f(A) - f(B)||^2 - ||f(A) - f(C)||^2 + m) \end{equation}
    Triplet loss

Metric learning for tracking

  1. Train an embedding network on triplets of data:

    - positive pairs: same person at different timesteps;

    - negative pairs: different people.

  2. Use the network to compute the similarity score for matching.

Summary of metric learning

Online vs. Offline tracking

Online tracking:

- "Given observations so far, estimate the current state."

Offline tracking:

- "Given all observations, estimate any state."

Graph based MOT

Tracking with network flow

Minimum-cost maximum-flow problem:

Determine the maximum flow with a minimum cost.

MOT

Minimizing the cost:

f=argminfC(I,j)f(I,j)\begin{equation} f^* = \arg \min_f \sum{C(I, j)f(I, j)} \end{equation}

where f`f` is the disjoint set of trajectories, C(I,j)`C(I, j)` indicates the cost, and f(I,j)`f(I, j)` means the indicator 0,1`{0,1}`.

To incorporate detection confidence, we split the node in two.

Network flow

where Cdet`C_{det}` indicates the detection confidence, and Ct`C_t` indicates the transition cost.

Problem with occlusions, such as:

Solution: Connect all nodes(detections) to entrance/exit nodes:

Solution to occlusion

And the graph subjects to flow conservation:

Flow conservation

MAP formulation

- Our solution is a set of trajectories τ`\tau^*`

T=argmaxTP(TX)\begin{equation} \mathcal{T}^* = \arg\max_{\mathcal{T}} P(\mathcal{T} \mid \mathcal{X}) \end{equation} =argmaxTP(XT)P(T)\begin{equation} = \arg\max_{\mathcal{T}} P(\mathcal{X} \mid \mathcal{T}) P(\mathcal{T}) \end{equation} =argmaxTiP(xiT)P(T)\begin{equation} = \arg\max_{\mathcal{T}} \prod_i P(x_i \mid \mathcal{T}) P(\mathcal{T}) \end{equation} =argmaxTiP(xiT)TiTP(Ti)\begin{equation} = \arg\max_{\mathcal{T}} \prod_i P(x_i \mid \mathcal{T}) \prod_{T_i \in \mathcal{T}} P(T_i) \end{equation}

Bayes rule

Conditional independence of observations

Independence of trajectories

MAP fomulation:

argmaxTiP(xiT)TiTP(Ti)\begin{equation} \arg \max_{\mathcal{T}}\prod_{i}{P(x_i|\mathcal{T)}}\prod_{T_i \in \mathcal{T}}{P(T_i)} \end{equation} argminTilogP(xiT)TITlogP(Ti) log-space for optimization\begin{equation} \arg \min_{\mathcal{T}} - \sum_i\log{P(x_i|\mathcal{T})} - \sum_{T_I \in \mathcal{T}}\log{P(T_i)} \rightarrow \text{ log-space for optimization} \end{equation}

Prior

TiTlogP(Ti)\begin{equation} \sum_{T_i \in \mathcal{T}} \log P(T_i) \end{equation}

Trajectory model: count the entrance, exit and transition costs

Ti:=(x0,x1,...,xn)\begin{equation} T_i := (\mathbf{x}_0, \mathbf{x}_1, ..., \mathbf{x}_n) \end{equation} P(Ti)=Pin(x0)j=1,nPt(xjxj1)Pout(xn)\begin{equation} P(T_i) = P_{\text{in}}(\mathbf{x}_0) \prod_{j=1,n} P_t(\mathbf{x}_j \mid \mathbf{x}_{j-1}) P_{\text{out}}(\mathbf{x}_n) \end{equation} logP(Ti)=logPin(x0)j=1,nlogPt(xjxj1)logPout(xn)\begin{equation} -\log P(T_i) = -\log P_{\text{in}}(\mathbf{x}_0) - \sum_{j=1,n} \log P_t(\mathbf{x}_j \mid \mathbf{x}_{j-1}) - \log P_{\text{out}}(\mathbf{x}_n) \end{equation}

Entrance Cost: logPin(x0)=fin(x0)Cin(x0)`-\log P_{\text{in}}(\mathbf{x}_0) = f_{\text{in}}(\mathbf{x}_0)C_{\text{in}}(\mathbf{x}_0)`

Transition Cost: logPt(xjxj1)=ft(xj,xj1)Ct(xj,xj1)`\log P_t(\mathbf{x}_j \mid \mathbf{x}_{j-1}) = f_t(\mathbf{x}_j, \mathbf{x}_{j-1})C_t(\mathbf{x}_j, \mathbf{x}_{j-1})`

Exit Cost: logPout(xn)=fout(xn)Cout(xn)`-\log P_{\text{out}}(\mathbf{x}_n) = f_{\text{out}}(\mathbf{x}_n)C_{\text{out}}(\mathbf{x}_n)`

Likelihood:

ilogP(xiT)\begin{equation} - \sum_i \log P(\mathbf{x}_i \mid \mathcal{T}) \end{equation}

We can use Bernoulli distribution:

P(xiT):={γi,if TjT,xiTj1γi,otherwise\begin{equation} P(\mathbf{x}_i \mid \mathcal{T}) := \begin{cases} \gamma_i, & \text{if } \exists T_j \in \mathcal{T}, \mathbf{x}_i \in T_j \\ 1 - \gamma_i, & \text{otherwise} \end{cases} \end{equation}

γi`\gamma_i` denotes prediction confidence (e.g., provided by the detector)

logP(xiT)=f(xi)logγi(1f(xi))log(1γi)\begin{equation} -\log P(\mathbf{x}_i \mid \mathcal{T}) = - f(\mathbf{x}_i) \log \gamma_i - (1 - f(\mathbf{x}_i)) \log (1 - \gamma_i) \end{equation} =f(xi)log1γiγilog(1γi)\begin{equation} = f(\mathbf{x}_i) \log \frac{1 - \gamma_i}{\gamma_i} - \log (1 - \gamma_i) \end{equation}

Cdet(xi)=log1γiγi`C_{\text{det}}(\mathbf{x}_i) = \log \frac{1 - \gamma_i}{\gamma_i}`

log(1γi)`\log (1 - \gamma_i)` can be ignored in optimization

Optimization

(Cdet,Cin,Cout,Ct`C_{\text{det}}, C_{\text{in}}, C_{\text{out}}, C_t`) are estimated from data. Then:

Summary

Open questions:

Tracking with Message Passing Network

End-to-end learning?

Setup

Deep learning on graphs

Key challenges:

Message Passing Network

Message passing

  1. Initial graph

    • Graph: G=(V,E)`G = (V, E)`

    • Initial embeddings:

      - Node embedding: hi(0),IV`h_i^{(0)}, I \in V`

      - Edge embedding: hi,j(0),(i,j)E`h_{i, j}^{(0)}, (i, j) \in E`

    • Embeddings after l`l` steps: hi(l),iV`h_i^{(l)}, i \in V` h(i,j)(l),(i,j)E`h_{(i, j)}^{(l)}, (i, j) \in E`

  2. "Node-to-edge" update

    Node-to-edge update
  3. "Edge-to-node" update

    1. Use the updated edge embeddings to update nodes:

      image

    2. After a round of edge updates, each edge embedding contains information about its pair of incident nodes.

    3. By analogy: hi(l)Nv([hi(l1),h(i,j)l])`h_i^{(l)} - \mathcal{N}_v([h_i^{(l-1)}, h_{(i, j)}^l])`

    4. In general, we may have an arbitrary number neighbors("degree", or "valency")

    5. Define a permutation-invariant aggregation function:

      Φ(l)(i):=Φ({h(l)(i,j)}jNe)\begin{equation} \Phi^{(l)}(i) := \Phi(\{h^{(l)}(i, j)\}_{j\in \mathcal{N}e}) \end{equation}

      where the input is a set of embeddings from incident edges.

    6. Re-define the edge-to-node updates for a general graph:

      image

      hi(l)=Nv(hi(l1),Φ(l)(i))\begin{equation} h_i^{(l)} = \mathcal{N}v(h_i^{(l-1)}, \Phi^{(l)}(i)) \end{equation}

Remarks:

MOT with MPN

MOT with MPN
  1. Input

  2. Graph construction + feature encoding

    - Encode appearance and scene geometry cues into node and edge embeddings.

  3. Neural message passing

    - Propagate cues across the entire graph with neural message passing.

  4. Edge classification

    - Learn to directly predict solutions of the Min-cost flow problem by classifying edge embeddings.

  5. Output

Feature encoding

Appearance and geometry encodings:

Geometry encodings
  1. Relative Box Position

    (2(yjyi)hi+hj,2(xjxi)wi+wj)\begin{equation} \left( \frac{2(y_j - y_i)}{h_i + h_j}, \frac{2(x_j - x_i)}{w_i + w_j} \right) \end{equation}
    • This encodes the normalized relative position of two bounding boxes.

    • The difference in the y-coordinate is divided by the sum of their heights.

    • The difference in the x-coordinate is divided by the sum of their widths.

    • The factor 2 ensures the values are in a normalized range.

  2. Relative Box Size

    (loghihj,logwiwj)\begin{equation} \left( \log \frac{h_i}{h_j}, \log \frac{w_i}{w_j} \right) \end{equation}
    • These terms encode the relative scale change between the two bounding boxes.

    • Using a logarithm makes it more robust to size differences.

  3. Time Difference

    tjti\begin{equation} t_j - t_i \end{equation}
    • This represents the time gap between the two detections.

    • It helps in determining whether two bounding boxes belong to the same object across frames.

*Shared weights of CNN and MLP for all nodes and edges

Contrast:

Temporal causality

Flow conservation at a node

Time-aware message passing:

Time-aware message passing

Classifying edges

Obtaining final solutions

Summary

Evaluation of MOT

Compute a set of measures per frame:

Metrics:

Segmentation

Segmentation

Flavors of image segmentation

K-means(clustering)

  1. Initialize(randomly) K cluster centers

  2. Assign points to clusters

    - using a distance metric

  3. Update centers by averaging the points in the cluster

  4. Repeat 2 and 3 until convergence.

Problem: The Euclidean distance may be insufficient. Instead, we want to cluster based on the notion of "connectedness".

Spectral clustering

Construct a Similarity Graph:

Compute the Graph Laplacian:

Optimization problem:

The Laplacian quadratic form measures how much variation exists between connected nodes.

xTLx=12i,jAi,j(xixj)2\begin{equation} x^TLx = \frac{1}{2}\sum_{i, j}A_{i,j}(x_i - x_j)^2 \end{equation}

where

Intuitively, if

xTLx=12i,jAi,j(xixj)20\begin{equation} x^TLx = \frac{1}{2}\sum_{i, j}A_{i,j}(x_i - x_j)^2 \approx 0 \end{equation}

then,

The solutions x=argminxxTLx`x^* = \arg \min_x x^TLx` are eigenvectors corresponding to the second smallest eigenvalues of L`L`.

Special case:

  1. Zero eigenvalue

    xTLx=12i,jAi,j(xixj)2=0,Aij for all i, j\begin{equation} x^TLx = \frac{1}{2}\sum_{i, j}A_{i,j}(x_i - x_j)^2 = 0, A_{ij} \text{ for all i, j} \end{equation}

    What can we say about the corresponding eigenvector?

  2. Connected components

    - Proposition: the multiplicity k`k` of eigenvalue 0 equals to the number of connected components, spanned by indicator vectors.

    - Example with two components(n×2)`n \times 2)`:

    Example
Spectral clustering in practice

Normalized cut(Ncut)

Spectral clustering as a min cut as Fig 54

Spectral clustering as a min cut

Problem with min cuts: Favour unbalanced isolated clusters:

Poor min cut

Balanced cut

Ncut(A,B)=cut(A,B)(assoc1(A,V)+assoc1(B,V))\begin{equation} Ncut(A, B) = cut(A, B)(assoc^{-1}(A, V) + assoc^{-1}(B, V)) \end{equation}

where

Ncut

  1. Define a graph G:=(V,E)`G:= (V, E)`

    • V is the set of nodes representing pixels.

    • E defines similarities of two nodes.

  2. Solves a generalized eigenvalue problem: (DW)x=λDx`(D - W)x = \lambda Dx`

    where

    • d(i)=jwi,j`d(i) = \sum_{j}w_{i,j}`

    • (DW)`(D-W)` is the Laplacian matrix

    • equivalently D12(DW)D12x=λx`D^{-\frac{1}{2}}(D-W)D^{-\frac{1}{2}}x = \lambda x`

  3. Use the eigenvector with the 2nd smallest eigenvalue to cut the graph.

  4. Recurse if needed(i.e. subdivide the two node groups).

Energy-based model: Conditional random fields(CRFs)

Energy function:

E(x,y)=iϕ(xi,yi)+i,jψ(xi,xj)\begin{equation} E(x, y) = \sum_i\phi(x_i, y_i) + \sum_{i, j}\psi(x_i, x_j) \end{equation}

where

Conditional Random Fields

Boykov and Jolly (2001)

E(x,y)=iφ(xi,yi)+ijψ(xi,xj)E(x,y) = \sum_i \varphi(x_i, y_i) + \sum_{ij} \psi(x_i, x_j)

Variables

Unary term

Pairwise term

Optimization with graph cuts:

Max-flow min-cut theorem: The maximum value of an s-t flow is equal to the minimum capacity over all s-t cuts.

Grid structured random fields

Fully connected models:

Using fully connected CRFs:

Fully connected CRFs

Fully convolutional neural networks

Deep networks for classification:

Classification networks

To extend this to segmentation:

Extended networks for segmentation
  1. Remove GAP:

    • Increase number of parameters

    • Fixed input size only

    • No translation invariance

  2. Replace fully connected layer with convolution(1×1`1 \times 1` convolution)

    • Few parameters

    • Variable input size

    • Translation invariance

  3. Upsample the last layer output to the original resolution.

  4. Can be trained with pixel-wise cross-entropy with SGD.

1×1`1\times 1` convolution

\(1\times1\) conv

We may try to maintain the original image resolution in the encoder:

Output stride: input resolution/output resolution.

Decreasing the output stride improves segmentation accuracy.

The problems of keeping feature resolution high in all layers:

Dilated convolutions: To maintain the receptive field size.

Problem 2: Computational footprint(both memory and FLOPs):

Solution: Upsampling

Recall convolution(as matrix multiplication):

Conv as matrix multiplication

We can obtain the opposite effect(increase the output size) by:

X=WTX Broadcasting\begin{equation} X = W^TX' \rightarrow \text{ Broadcasting} \end{equation}

Transposed convolution:

Upsampling: Interpolation

Resize-convolution:

U-Net

To mitigate information loss, we apply the first layers in the encoder via a skip connection.

U-Net

U-Net’s key unique features:

SegNet

SegNet

SegNet is a deep learning architecture primarily used for image segmentation tasks. It is similar to U-Net in that it uses an encoder-decoder structure, but it has some unique features that distinguish it.

Best practices for segmentation

Instance segmentation

  1. Label every pixel, including the background(sky, grass, road).

  2. Does not differentiate between the pixels from objects(instances) of the same class.

  3. Do not label pixels from uncountable objects("stuffs"), e.g., "sky", "grass", "road".

  4. Differentiates between the pixels coming from instances of the same class.

Instance segmentation methods

Proposal based method: Mask R-CNN

Start with Faster R-CNN

Mask R-CNN

RoIPool vs. RoIAlign

RoIAlign:

RoIAlign

Problem of Mast R-CNN: Low mask resolution.

Mask R-CNN with PontRend

Remarks:

Proposal-free method

We obtain a semantic map using Fully convolution networks for semantic segmentation.

SOLOv2

Idea: Predict the kernels:

SOLOv2

Instance segmentation: Summary

Panoptic segmentation

Typical architecture

Challenges:

Two broad categories:

Panoptic FPN

Panoptic FPN

Remark:

Training with multiple loss terms("multi-task learning") can be challenging, as different loss terms may "compete" for desired feature representation.

Panoptic FPN: Summary

Panoptic FCN

Even simpler?

Panoptic FCN
Panoptic FCN

Panoptic evaluation

Video object segmentation

Goal: Generate accurate and temporally consistent pixel masks for object in a video sequence.

Challenges:

What we need:

  1. Appearance model:

    • Assumption: constant appearance.

    • Input: 1 frame.

    • Output: segmentation mask.

  2. Motion model(maybe optional):

    • Assumption: smooth displacement; bright constancy.

    • Input: 2 frames.

    • Output: motion(optical flow).

Semi-supervised(one-shot) VOS task formulation:

Motion-based VOS

Optical flow:

Minimize brightness difference:

Brightness difference
ESSD(u,v)=(x,y)R(I(x+u,y+v,t+1)I(x,y,t))2\begin{equation} E_{SSD}(u, v) = \sum_{(x, y) \in R}(I(x+u, y+v, t+1) - I(x, y, t))^2 \end{equation}

Lucas-Kanade:

[uv]=[RIx2RIxIyRIxIyRIy2]1[RIxItRIyIt]\begin{bmatrix} u \\ v \end{bmatrix} = \begin{bmatrix} \sum_R I_x^2 & \sum_R I_x I_y \\ \sum_R I_x I_y & \sum_R I_y^2 \end{bmatrix}^{-1} \begin{bmatrix} -\sum_R I_x I_t \\ -\sum_R I_y I_t \end{bmatrix}

This is a classical flow technique - Lucas-Kanade method.

Joint formulation of segmentation and optical flow estimation:

VOS with optical flow

FlowNet: Architecture 1

FlowNet: Architecture 2 `\rightarrow` Siamese architecture

Siamese architecture

Correlation layer:

Motion-based VOS: Summary

Appearance-only VOS

Main idea:

One-shot VOS(OSVOS): Separate the training steps

OSVOS

Drifting problem:

OnAVOS: Online Adaptation

OnAVOS

MaskTrack

MaskTrack

Traning inputs can be simulated

Appearance-only VOS: Summary

Advantages of appearance-based models:

Disadvantages:

Metric-based approaches

Pixel-wise retrieval:

Transformers

CNNs: A few relevant facts

Universality: Developing components that work across all possible settings:

Self-attention: A hash table analogy

Consider a hash table:

Putting it all together:

Attention(Q,K,V)=softmax(QKTn)V\begin{equation} Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{n}})V \end{equation}

Scaling factor 2`\sqrt{2}` eases optimization especially for large n`n`

Linear projection

Complexity

Multi-head attention

Normalization:

Further improvements:

Recap:

Transformers

Transformer-based encoder:

Encoder

Vision Transformers

An all-round transformer architecture:

Steps:

  1. Split an image into fixed-sized patches.

  2. Linearly embed each patch(a fully connected layer).

  3. Attach an extra "[class]" embedding(learnable).

  4. Add positional embeddings.

  5. Feed the sequence to the standard Transformer encoder.

  6. Predict the class with an MLP using the output [class] token.

image]

Vision Transformer

Experiment with ViT

Swin Transformers

Recall ViT

Swin Transformer:

A naive implementation will hurt context reasoning:

At any given level(except the last one), our context is not global anymore:

Naive Swin

Solution: Alternate the layout of local windows:

Mature Swin

Successively decrease the number of tokens using patch merging:

Swin Transformer

Results:

Summary:

Transformer for detection?

DETR

DETR

A closer look:

DETR

The losss function

LHungarian(y,y^)=i=1N[logp^σ^(i)(ci)+1{ci}Lbox(bi,b^σ^(i))]\begin{equation} \mathcal{L}_{Hungarian}(y, \hat{y}) = \sum^N_{i=1}[-\log{\hat{p}_{\hat{\sigma}(i)}(c_i)} + \mathbf{1}_{\{c_i \neq \emptyset\}}\mathcal{L}_{box}(b_i, \hat{b}_{\hat{\sigma}}(i))] \end{equation}

where

The bounding box loss:

λiouLiou(bi,b^σ(i))+λL1bib^σ(i)1\begin{equation} \lambda_{iou}\mathcal{L}_{iou}(b_i, \hat{b}_{\sigma(i)}) + \lambda_{L1}||b_i - \hat{b}_{\sigma(i)}||_1 \end{equation}

DETR: Summary

MaskFormer

MaskFormer

A unified model for semantic and panoptic segmentation.

Recall: Panoptic FCN

Panoptic FCN

MaskFormer’s idea: Compute the kernels using learnable queries with a Transformer.

Works well, but has troubles with small objects/segments.

Mask2Former

Idea: Attend with self-attention at multiple scales of the feature hierarchy:

Mask2Former

Masked attention:

Mask2Former: Summary

Conclusions

Unsupervised

Learning without labels

Compact, yet descriptive representation?

Consider the loss:

L=I(X;Z)βI(Z;Y)\begin{equation} \mathcal{L} = I(X; Z) - \beta I(Z;Y) \end{equation}
Unsupervised loss

where

Recap: Mutual Information(MI)

I(X;Y)=xXyYP(X,Y)(x,y)logP(X,Y)(x,y)PX(x)PY(y)\begin{equation} I(X;Y) = \sum_{x\in \mathcal{X}}\sum_{y \in \mathcal{Y}}P_{(X, Y)}(x, y) \log \frac{P_{(X, Y)}(x, y)}{P_X(x)P_Y(y)} \end{equation}

After obtaining compact and descriptive representation, we can apply it for different tasks, such as detection, classification, segmentation, etc. by adding task-specific shallow models(e.g., linear projection).

Evaluating Self-supervised learning(SSL) models

Pretext tasks

Idea: Solve a different task with available(generated) supervision.

Caveats:

Pretext task: Rotation

Task: predicting image rotation.

Training process outline

  1. Quantize rotation angles(e.g., 0, 90, 180, 270 - 4 classes).

  2. Sample an image from the training dataset.

  3. Rotate the image by one of the pre-defined angles.

    - This defines our "ground-truth" label.

  4. Train the network to predict the correct rotation class.

Pretext task: Rotation

Pretext task: Jigsaw puzzle

Jigsaw puzzle

Pretext task: Colorization

Predicting the original color of the images(in CIELAB color space):

Intuition: Proper colorization requires semantic understanding of the image.

Colorization

Nuances:

Contrastive learning

Recall metric learning

L(A,B,C)=max(0,f(A)f(B)2f(A)f(C)2+2\begin{equation} \mathcal{L}(A, B, C) = \max{(0, ||f(A) - f(B)||^2 - ||f(A) - f(C)||^2 + 2} \end{equation}

Intuitive idea:

Metric learning

We need labels to define positive and negative pairs.

Contrastive learning is an extension of metric learning to unsupervised scenarios.

Idea:

Example:

Intuition

Deep frameworks for SSL

Non-contrastive learning

DINO

DINO

DINOv2

DINOv2

Masked Autoencoders

Unsupervised learning with Transformers and a reconstruction loss:

Masked autoencoders

Reconstructing HoG features, instead of pixel values:

Reconstructing HoG features

Remarks:

Downstream applications

What do DINO features encode?

DINO(and other SSL methods) provides semantic correspondence(almost) out-of-the-box.

Extracted features
Extracted features

We can also cluster "background" (stuff) areas, hence obtain semantic segmentation.

High level idea: Learn a lower dimensional embedding, S(f(i),f(j))`S(f(i), f(j))`, such that clustering in this space yields semantic masks.

Extracted features

Self-supervision in videos

Two groups of problems:

Segmentation from motion cues Idea: If the object mask is correct, we cannot reconstruct object-related optical flow. We train two networks

Contrastive random walk

Forward-backward cycle consistency:

Conclusion

Semi-supervised learning

Training on labeled and unlabeled data.

Semi-supervised

General remarks:

Small print:

  1. Improvement is not always guaranteed.

    - It depends on the model, the technique used and the unlabeled data.

  2. Semi-supervised techniques are often complementary.

    - A combination of multiple techniques yields the best results(though make the framework more complex).

A practical perspective:

Semi-loss

Assumptions

Assumptions about semi-supervised learning:

Smoothness assumption

If two input points are close by, their labels should be the same.

Transitivity

Low-density assumption

The decision boundary should pass through a region with low density p(x)`p(x)`.

Manifold assumption

Remark:

Which assumptions to make depends on what we know about how our data distribution p(x)`p(x)` interacts with the class posterior p(yx)`p(y|x)`.

Two taxonomies

How unlabeled data is used:

Unsupervised pre-processing

- pre-training, clustering, etc.;

- Two stages:

Wrapper methods

Self-training

A single classifier trained jointly on labeled and self-labeled data from the unlabeled dataset.

OnAVOS:

OnAVOS

Online adaptation: Adapt model to appearance changes in every frame, not just the first frame.

Drawback: Can be slow.

Online Adaptation
OnAVOS with unlabeled dataset

Segment Anything (2)

Self training with pseudo labels

A general outline:

  1. Train a strong baseline on the labeled set;

    - e.g. with heavy data augmentation(crops, photometric noise).

  2. Predict pseudo-labels for the unlabeled set.

  3. Continue training the network on both labeled and pseudo-labeled samples.

  4. Repeat steps 2-3.

Intrinsically semi-supervised

Entropy minimization

Virtual adversarial network

Idea: Perturbations of the input should not change the label.

Learning from synthetic data

Labeled and unlabeled data may come from different distributions.

e.g. due to differences in the synthetic and real appearances.

Domain alignment

This translates into disjoint feature distribution of a model trained only on the labeled data:

Domain alignment

Solution: "Align" the two distributions.

Domain alignment means making two feature distributions indistinguishable.

We can use a GAN:

GAN

Consistency regularization

Consistent prediction across image transformations.

Semantic meaning does not change, though not guaranteed in a deep net `\rightarrow` Use it as a consistency loss.

Framework:

Araslanov and Roth

Momentum net

Test-time augmentation is applied online at training time:

Momentum net

Limited supervision

Limited supervision

Weakly supervised learning: training coarse labels:

Segmentation with image-level labels

Idea: we can reuse the classifier weights to classify every pixel.

Summary