티스토리 뷰

2022.02.08 Review

---

https://arxiv.org/pdf/2104.02243v2.pdf

 

Background

  • Semantic Segmenation
    • classifying each pixel in an image from a predefined set of classes
  • Knowledge Distillation
    • Process of transferring knowledge from model to other model that has different structure
      • usaually, model compression method in which a small model is trained to mimic a pre-trained, larger model
      • start to research for adapt different domain

 

  • Indoor Scene Parsing
    • Important role in many application. such as robot navigation, Augmented Reality
    • Challenging
      • including distorted object shapes, severe occlusions, viewpoint variations, and scale ambiguities
    • Previous approach
      • use geometric information
        • leverage additional geometric information to obtain structured information (typically use the depth map)
        • require the availability of the depth map inputs not only in the training but also in the testing
        • limited applicability to general situations, in which depth is not available

3D-to-2D Distillation for Indoor Scene Parsing

Goal

  • Enhance 2D features extracted from RGB images using by 3D features extracted from largescale 3D data repositories for indoor scsene parsing

Contribution

  • Distill 3D knowledge from a pretrained 3D network to supervise a 2D network
  • Design a two-stage dimension normalization method to calibrate the 2D and 3D features for better integration
  • Design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data

Detail

  • 3D-to-2D distillation framework
    • Framework
      • Train input : 2D image, 3D Point Cloud
      • Extract each feature using by 3D CNN, 2D CNN
      • train to produce simulated 3D feature $f_{3d\_sim}$ from 2D CNN with dimension normalization
      • concat 3D & 2D feature to generate segmentation map
      • with unpaired dataset, adversarial Training with discriminator
    • 3D-to-2D distillation
      • Training the network to learn to simulate 3D features from 2D feature of the input image
      • if paired 2D-3D data is available in the training
        • can determine the pixel location associated with each point in the input point cloud
        • L2 regression loss between $f_{3d}$ and $f_{3d\sim}$ to supervise the generation of $f{3d\_sim}$ in the 2D network
      • To train the whole framework for the semantic segmentation task
        • use cross entropy loss.
    • Dimension Normalization
      • Reduce the numerical distribution gap between the 2D and 3D features, as they are from different data modalities and neural network models
      • $µ$ , $σ$ : channel-wise means and variances of the feature map over the N, H, and dimensions
      • The purpose of DN1 is to transform the input 2D feature to align its distribution with that of the 3D feature for effective feature learning
      • The purpose of DN2 is to calibrate the learned 3D feature f3D_sim back to f2D for smooth 2D-3D feature concatenation
    • Adversarial Training with Unpaired Data
      • paired 2D-3D data are typically expensive to acquire in a large quantity
      • design a new adversarial training approach to correlate 2D and 3D features and supervise the generation of the simulated 3D features without pair
      • key insight
        • existing indoor 2D and 3D datasets usually have common object categories
        • propose a novel adversarial training model with a per-category discriminator
          • determines whether the feature is a real 3d feature or is generated from a 2d cnn
      • Goal
        • the 2D network should learn to generate simulated 3D features solely without requiring paired 2D-3D data, such that the discriminator cannot differentiate the 3D features from the simulated one of the same category
      • Discriminator
        • Each discriminator using six fully-connected layers
        • Input : $f_{3d}$ or $f_{3d\_sim}$
        • Predict a confidence score that indicates whether the input feature vector comes from the 2D or 3D network.
      • Adversarial Train
        • $Φ_{2d}$ : 2D network that generate $f_{3d\_sim}$
        • $D_c$ : Discriminator for category c
        • similar to the way the generator (like
        $Φ_{2d}$) and discriminator (like D) are trained in a conventional GAN model
      • where Nc,i (or Mc,j ) equals 1, if the 2D pixel i (or 3D point j) belongs to category c
    •  

 

Experiments

  • Dataset:
    • 3D, 2D : ScanNet-v2, S3DIS
    • only 2D : NYUv2
  • Can change Segmentation, 3D CNN archtecture
  • Comparing
    • Comparing with Related Geometry-Assisted and Knowledge Distillation Methods

  • Baseline
    • the original PSPNet-50 model ( just segmentation )
    • Multitask : predict a depth map (supervised), 2d feature
    • Cascade : SOTA Depth prediction 결과와 함께 Segmentation
    • Geo-aware : distills features from depth maps for assisting 2D semantic segmentation
    • KD : most recent knowledge distillation approach for semantic segmentation
  • Outperforms all of them
    • the simulated 3D features leads to better results than explicitly predicting a depth map
    • 3D-to-2D distillation approach can lead to a higher performance than simply adopting a general knowledge distillation model
  • Further Evaluations with Sparse Paired Data
    • can easily incorporate it into existing semantic segmentation architecture.
  • 3D-to-2D Distillation without Paired Data
    • NYU-v2 as the 2D data, ScanNet-v2 as the 3D data to train
    • Distill unpaired 3D data to enrich features in the 2D network 

 

Analysis

  • occlusions (Figure 5: door, bathtub and toilet, Figure 6: chair and table),
  • uncommon viewpoints (Figure 5: table)
  • confusions due to similar color and texture (Figure 6: column, and chair)

 

  • Utilize 3D information
    • This approach facilitates the 2D network to better utilize 3D information when constructing the deep features.
  • Generalization domain abilities
    • the embedded 3D information in deep features may help the CNN better utilize the robust 3D cues, such as shape, for recognition, and further improve the model’s
    • trained in ScanNet, Tested in NYU

 

Conclusion

  • presents a novel 3D-to-2D distillation framework that effectively leverages 3D features learned from 3D point clouds to enhance 2D networks for indoor scene parsing
    • 2D network can infer simulated 3D features without any 3D data input
  • propose a two-stage distillation normalization module for effective feature integration
    • To bridge the statistical distribution gap between the 2D and 3D features
  • Extend framework for training with unpaired 2D-3D data
    • design an adversarial training model with the semantic aware adversarial loss

 

How can we use

  • our warehouse has similiar environment with indoor. they has typical 3d structure
    • Indoor ( Floor, Pillar, Ceiling .. )
    • Structure ( Rack, Package.. )
    • leverage our 3d structure knowledge
  • We have some sources to get 3D data
    • Lidar mapping
    • 3D synthetic warehouse
  • We can use this 3d data in our training, and we can infer in motionkit without 3d data
  • If 3D feature help to make Generalization, It can adopt other warehouse easily.
  • Keep following!
댓글
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2025/04   »
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30
글 보관함