[Review] 3D-to-2D Distillation for Indoor Scene Parsing

티스토리 뷰

Paper Review

[Review] 3D-to-2D Distillation for Indoor Scene Parsing

꿈꾸는컴퓨터 2023. 12. 27. 02:02

2022.02.08 Review

---

https://arxiv.org/pdf/2104.02243v2.pdf

Background

Semantic Segmenation
- classifying each pixel in an image from a predefined set of classes
Knowledge Distillation
- Process of transferring knowledge from model to other model that has different structure
  - usaually, model compression method in which a small model is trained to mimic a pre-trained, larger model
  - start to research for adapt different domain

Indoor Scene Parsing
- Important role in many application. such as robot navigation, Augmented Reality
- Challenging
  - including distorted object shapes, severe occlusions, viewpoint variations, and scale ambiguities
- Previous approach
  - use geometric information
    - leverage additional geometric information to obtain structured information (typically use the depth map)
    - require the availability of the depth map inputs not only in the training but also in the testing
    - limited applicability to general situations, in which depth is not available

3D-to-2D Distillation for Indoor Scene Parsing

Goal

Enhance 2D features extracted from RGB images using by 3D features extracted from largescale 3D data repositories for indoor scsene parsing

Contribution

Distill 3D knowledge from a pretrained 3D network to supervise a 2D network
Design a two-stage dimension normalization method to calibrate the 2D and 3D features for better integration
Design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data

Detail

3D-to-2D distillation framework
- Framework
  - Train input : 2D image, 3D Point Cloud
  - Extract each feature using by 3D CNN, 2D CNN
  - train to produce simulated 3D feature $f_{3d\_sim}$ from 2D CNN with dimension normalization
  - concat 3D & 2D feature to generate segmentation map
  - with unpaired dataset, adversarial Training with discriminator
- 3D-to-2D distillation
  - Training the network to learn to simulate 3D features from 2D feature of the input image
  - if paired 2D-3D data is available in the training
    - can determine the pixel location associated with each point in the input point cloud
    - L2 regression loss between $f_{3d}$ and $f_{3d\sim}$ to supervise the generation of $f{3d\_sim}$ in the 2D network
  - To train the whole framework for the semantic segmentation task
    - use cross entropy loss.
- Dimension Normalization
  - Reduce the numerical distribution gap between the 2D and 3D features, as they are from different data modalities and neural network models
  - $µ$ , $σ$ : channel-wise means and variances of the feature map over the N, H, and dimensions
  - The purpose of DN1 is to transform the input 2D feature to align its distribution with that of the 3D feature for effective feature learning
  - The purpose of DN2 is to calibrate the learned 3D feature f3D_sim back to f2D for smooth 2D-3D feature concatenation
- Adversarial Training with Unpaired Data
- - paired 2D-3D data are typically expensive to acquire in a large quantity
  - design a new adversarial training approach to correlate 2D and 3D features and supervise the generation of the simulated 3D features without pair
  - key insight
    - existing indoor 2D and 3D datasets usually have common object categories
    - propose a novel adversarial training model with a per-category discriminator
      - determines whether the feature is a real 3d feature or is generated from a 2d cnn
  - Goal
    - the 2D network should learn to generate simulated 3D features solely without requiring paired 2D-3D data, such that the discriminator cannot differentiate the 3D features from the simulated one of the same category
  - Discriminator
    - Each discriminator using six fully-connected layers
    - Input : $f_{3d}$ or $f_{3d\_sim}$
    - Predict a confidence score that indicates whether the input feature vector comes from the 2D or 3D network.
  - Adversarial Train
    - $Φ_{2d}$ : 2D network that generate $f_{3d\_sim}$
    - $D_c$ : Discriminator for category c
    - similar to the way the generator (like
    $Φ_{2d}$) and discriminator (like D) are trained in a conventional GAN model
  - where Nc,i (or Mc,j ) equals 1, if the 2D pixel i (or 3D point j) belongs to category c

Experiments

Dataset:
- 3D, 2D : ScanNet-v2, S3DIS
- only 2D : NYUv2
Can change Segmentation, 3D CNN archtecture
Comparing
- Comparing with Related Geometry-Assisted and Knowledge Distillation Methods

Baseline
- the original PSPNet-50 model ( just segmentation )
- Multitask : predict a depth map (supervised), 2d feature
- Cascade : SOTA Depth prediction 결과와 함께 Segmentation
- Geo-aware : distills features from depth maps for assisting 2D semantic segmentation
- KD : most recent knowledge distillation approach for semantic segmentation
Outperforms all of them
- the simulated 3D features leads to better results than explicitly predicting a depth map
- 3D-to-2D distillation approach can lead to a higher performance than simply adopting a general knowledge distillation model
Further Evaluations with Sparse Paired Data
- can easily incorporate it into existing semantic segmentation architecture.
3D-to-2D Distillation without Paired Data
- NYU-v2 as the 2D data, ScanNet-v2 as the 3D data to train
- Distill unpaired 3D data to enrich features in the 2D network

Analysis

occlusions (Figure 5: door, bathtub and toilet, Figure 6: chair and table),
uncommon viewpoints (Figure 5: table)
confusions due to similar color and texture (Figure 6: column, and chair)

Utilize 3D information
- This approach facilitates the 2D network to better utilize 3D information when constructing the deep features.
Generalization domain abilities
- the embedded 3D information in deep features may help the CNN better utilize the robust 3D cues, such as shape, for recognition, and further improve the model’s
- trained in ScanNet, Tested in NYU

Conclusion

presents a novel 3D-to-2D distillation framework that effectively leverages 3D features learned from 3D point clouds to enhance 2D networks for indoor scene parsing
- 2D network can infer simulated 3D features without any 3D data input
propose a two-stage distillation normalization module for effective feature integration
- To bridge the statistical distribution gap between the 2D and 3D features
Extend framework for training with unpaired 2D-3D data
- design an adversarial training model with the semantic aware adversarial loss

How can we use

our warehouse has similiar environment with indoor. they has typical 3d structure
- Indoor ( Floor, Pillar, Ceiling .. )
- Structure ( Rack, Package.. )
- leverage our 3d structure knowledge
We have some sources to get 3D data
- Lidar mapping
- 3D synthetic warehouse
We can use this 3d data in our training, and we can infer in motionkit without 3d data
If 3D feature help to make Generalization, It can adopt other warehouse easily.
Keep following!

'Paper Review' 카테고리의 다른 글

[Review] A Review of Video Object Detection: Datasets, Metrics and Methods (0)	2023.12.27
[Review] Class-agnostic Object Detection (1)	2023.12.27
[Review] Unity Perception: Generate Synthetic Data for Computer Vision (1)	2023.12.27
[Review] YOLOv7 Review, Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors (1)	2023.12.21

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

글 보관함

꿈꾸는자취방

티스토리 뷰

[Review] 3D-to-2D Distillation for Indoor Scene Parsing

Background

3D-to-2D Distillation for Indoor Scene Parsing

Goal

Contribution

Detail

Experiments

Analysis

Conclusion

How can we use

'Paper Review' 카테고리의 다른 글

티스토리툴바