3.12

Managed to understand the whole code base of the CLIP repo from OpenAI. Planned to take a look at CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation, to understand how to implement Open-Vocabulary Segmentation (OVS) using CLIP.

3.13

1. DETR

Got a basic understanding of DETR, which is an awesome end-to-end 2D object detection architecture, with its downside lies in:

  • Long training period
  • Difficulty of detecting small objects

but has advantages in:

  • Use object queiries to replace anchor generation
  • Use Hangurian algorithm to replace NMS post-processing stage

which basically replace things that are not learnable into learnable parameters.

The follow-up work: Deformable DETR, DINO, Omni-DETR, Up-DETR, PNP-DETR, SMAC-DETR, DAB-DETR, SAM-DETR, DN-, OW-, OV-, Pixel2Seq……

2. Picture generation survey

  • GAN:
    • Merit:
      • Pictures are real.
    • Downside:
      • Unstable training process (the only noise is the initial noise at the beginning of the training).
      • The outputs lack diversity, meaning that they are close to the original pictures.
  • AE (auto-encoder):image-20240313212557543
  • DAE (Denoising auto-encoder):image-20240313212652120

Added noise $X_c$ This paper proves that “Images have a high degree of redundancy”. This work is also similar to MAE.

3. DALL·E 2

  • Diffusion model is actually a multi-layered VAE.

3.14 3D object detection

1. Datasets
  • KITTY: the most classic dataset used in 3d obejct detection. image-20240320132044562 image-20240320132433043
    Category Truncation Occlusion Alpha Bbox_X1 Bbox_Y1 Bbox_X2 Bbox_Y2 Dimensions_3D_Height Dimensions_3D_Width Dimensions_3D_Length Location_X Location_Y Location_Z Yaw
    Pedestrian 0.00 0 -0.20 712.40 143.00 810.73 307.92 1.89 0.48 1.20 1.84 1.47 8.41 0.01

image-20240320132739424

Can use Open3D for further use.

  • Waymo: a large dataset released by Waymo in 2018.
  • nuScenes: a large-scale autonomous driving dataset (used token for query). image-20240320133312528 image-20240320133625985
  • Argoverse2
  • Lyft
2. Model Zoo

From the time zone

image-20240320103630883

Most survey roughly divide the models into 4 parts: Multi-viewed, Voxel-based, Point-based, and Point-Voxel-based methods. But the overall, the improvement process from the perspective of the time zone is more likely to be digested by beginners.

image-20240320103719709

(An illustration of point-based 3D object detection methods.)

Before 2017

  • VeloFCN: which transforms 3d point clouds to 2d front view. This is not a that good idea, as many points can be mapped to the same position as well as the lack of depth information.
  • MV3D: Combine LiDAR Bird view (BV), LiDAR Front view (FV), and RGB Image information and fuse them together to get the overall feature. This method as for me is quite astonishing, since I think it is the pioneer model that introduce multi-modality into 3d object detection. The backend detector is typically R-CNN at that time, so it costs a lot of time.

image-20240320104933704

During 2017

Two break-through workds came out: VoxelNet and PointNet++. VoxelNet extracted the feature from the perspective of 3d voxel while PointNet++ from the aspect of point.

  • VoxelNet: First, the point cloud is quantized into a uniform 3D grid (as shown in “grouping” in the figure below). Within each grid, a fixed number of points are randomly sampled (with repetitions if there are not enough points). Each point is represented by a 7-dimensional feature, including the X, Y, Z coordinates of the point, its reflectance intensity (R), and the position difference (ΔX, ΔY, ΔZ) relative to the grid’s centroid (the mean position of all points within the grid). Fully connected layers are used to extract features from each point, and then the features of each point are concatenated with the mean features of all points within the grid to form new point features. The advantage of this feature is that it preserves both the characteristics of individual points and the characteristics of a small local area (the grid) surrounding the point. This process of point feature extraction can be repeated multiple times to enhance the descriptive power of the features (as shown in “Stacked Voxel Feature Encoding” in the figure below). Finally, a max pooling operation is performed on all points within the grid to obtain a fixed-length feature vector. All of the above is the feature extracting network. Accompany with RPN, the network can accomplish 3d object detection.

image-20240320105551659

  • PointNet++: The primary approach involves using clustering to generate multiple candidate regions (each region being a set of points), and within each candidate region, PointNet is used to extract features of the points. This process is repeated multiple times in a hierarchical manner, where the multiple sets of points output by the clustering algorithm at each iteration are treated as abstracted point clouds for subsequent processing (Set Abstraction, SA). The point features obtained in this manner have a large receptive field and contain rich contextual information from the local neighborhood. Finally, PointNet classification is performed on the sets of points produced by multiple layers of SA to distinguish between objects and the background. Similarly, this method can also be applied to point cloud segmentation.

image-20240320110001677

image-20240320110042687

The adantage and also the downside as compared PointNet++ with VoxelNet:

  • Merit:
    • There’s not that big information loss or gap in PointNet++, and there are barely any hyper-parameter for you to set up.
    • Didn’t use 3D conv.
  • Downside:
    • Didn’t use mature 2D conv to extract feature to ensure both accuracy and efficiency.
    • Too many MLP leads to low efficiency.

Between 2018 to 2020

During this period, lots of follow-up works came out after the invention of VoxelNet and PointNet++.

  • Towards voxel-based:

    • SECOND: utilized sparse conv method, accelerating the speed to 26 FPS and lower and also reduces the usage of graphics memory.

    • PointPillar: Instead of using 3D conv, it stack all the voxels into pillars so that it can utilize the mature hardware acceleration about 2D conv, making its speed up to 62 FPS

image-20240320124503012

  • Towards point-based: The SA process in PointNet++ makes the overall process slow, so many follow-up methods came up with the idea of utilizing 2D conv to solve this problem.

    • Point-RCNN: First, PointNet++ is used to extract features from points. These features are then used for foreground segmentation to distinguish between points on objects and background points. At the same time, each foreground point also outputs a 3D candidate bounding box (BBox). The next step involves further feature extraction from points within the candidate BBox, determining the object category to which the BBox belongs, and refining its position and size. Those familiar with 2D object detection might recognize this as a typical two-stage detection model. Indeed, but the difference is that Point-RCNN generates candidates only on foreground points, thereby avoiding the immense computational cost associated with generating dense candidate boxes in 3D space. Nevertheless, as a two-stage detector and considering the substantial computational demand of PointNet++ itself, Point-RCNN still operates at a relatively low efficiency of about 13 FPS. Point-RCNN was later extended to Part-A2, which achieved improvements in both speed and accuracy.

      image-20240320130208632

    • 3D-SSD: analyzes the components of previous point-based methods and concludes that the Feature Propagation (FP) layer and the refinement layer are bottlenecks for system speed. The role of the FP layer is to remap the abstracted point features from the Set Abstraction (SA) layer back to the original point cloud, analogous to the Point Cloud Decoder in Point-RCNN as depicted in the figure above. This step is crucial because the abstract points output by SA do not effectively cover all objects, leading to significant information loss. 3D-SSD introduces a new clustering method that considers the similarity between points in both geometric and feature spaces. Through this improved clustering method, the output of the SA layer can be directly used to generate object proposals, avoiding the computational cost associated with the FP layer. Furthermore, to circumvent the region pooling in the refinement phase, 3D-SSD directly uses representative points from the SA output. It utilizes the improved clustering algorithm mentioned earlier to find neighboring points and employs a simple MLP to predict categories and 3D bounding boxes for objects. 3D-SSD can be considered an anchor-free, single-stage detector, aligning with the development trend in the object detection domain. With these improvements, 3D-SSD achieves a processing speed of 25 FPS.

      image-20240320130621693

  • Integration of voxel-based and point-based methods:

    On one side, voxels heavily rely on the granularity of quantization parameters: larger grids lead to significant information loss, while smaller grids escalate computational and memory demands. Imagine trying to piece together a puzzle with either too large or too minuscule pieces – neither scenario is ideal for capturing the full picture efficiently. On the other side, points pose their own set of challenges, particularly in extracting contextual features from their neighborhoods and dealing with irregular memory access patterns. In fact, about 80% of the runtime is often consumed by data construction rather than the actual feature extraction process. It’s akin to spending most of your time organizing your tools instead of painting.

    VoxelNet SECOND PointPillar PointRCNN 3D-SSD
    AP 64.17% 75.96% 74.31% 75.64% 79.57%
    FPS 2.0 26.3 62.0 10.0 25.0

    The fundamental strategy for merging the strengths of voxels and points involves leveraging lower-resolution voxels to capture contextual features (as seen in PV-CNN) or generate object candidates (like in Fast Point RCNN), or even both (examples include PV-RCNN and SA-SSD). Then, these are combined with the original point cloud, preserving both the nuanced features of individual points and the spatial relationships among them. Imagine blending the broad strokes of a paintbrush with the precision of a pencil to create a detailed and context-rich artwork.

    The research in PV-CNN, PV-RCNN, Voxel R-CNN, CenterPoint TO BE CONTINUED

LiDAR-based model

Point-based model

2022

  • SASA: Semantics-Augmented Set Abstraction for Point-based 3D Object Detection (AAAI 22)

2021

  • 3D Object Detection with Pointformer (CVPR 21)

  • Relation Graph Network for 3D Object Detection in Point Clouds (T-IP 21)

  • 3D-CenterNet: 3D object detection network for point clouds with center estimation priority (PR 21)

2020

  • 3DSSD: Point-based 3D Single Stage Object Detector (CVPR 20)

  • Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud (CVPR 20)

  • Joint 3D Instance Segmentation and Object Detection for Autonomous Driving (CVPR 20)

  • Improving 3D Object Detection through Progressive Population Based Augmentation (ECCV 20)

  • False Positive Removal for 3D Vehicle Detection with Penetrated Point Classifier (ICIP 20)

2019

  • PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud (CVPR 19)

  • Attentional PointNet for 3D-Object Detection in Point Clouds (CVPRW 19)

  • STD: Sparse-to-Dense 3D Object Detector for Point Cloud (ICCV 19)

  • StarNet: Targeted Computation for Object Detection in Point Clouds (arXiv 19)

  • PointRGCN: Graph Convolution Networks for 3D Vehicles Detection Refinement (arXiv 19)

2018

  • IPOD: Intensive Point-based Object Detector for Point Cloud (arXiv 18)

Grid-based 3D Object Detection (Voxels & Pillars)

2021

  • Object DGCNN: 3D Object Detection using Dynamic Graphs (NeurIPS 21)

  • Center-based 3D Object Detection and Tracking (CVPR 21)

  • Voxel Transformer for 3D Object Detection (ICCV 21)

  • LiDAR-Aug: A General Rendering-based Augmentation Framework for 3D Object Detection (CVPR 21)

  • RAD: Realtime and Accurate 3D Object Detection on Embedded Systems (CVPRW 21)

  • AGO-Net: Association-Guided 3D Point Cloud Object Detection Network (T-PAMI 21)

  • CIA-SSD: Confident IoU-Aware Single-Stage Object Detector From Point Cloud (AAAI 21)

  • Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection (AAAI 21)

  • Anchor-free 3D Single Stage Detector with Mask-Guided Attention for Point Cloud (ACM MM 21)

  • Integration of Coordinate and Geometric Surface Normal for 3D Point Cloud Object Detection (IJCNN 21)

  • PSANet: Pyramid Splitting and Aggregation Network for 3D Object Detection in Point Cloud (Sensors 21)

2020

  • Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization (NeurIPS 20)

  • HVNet: Hybrid Voxel Network for LiDAR Based 3D Object Detection (CVPR 20)

  • Associate-3Ddet: Perceptual-to-Conceptual Association for 3D Point Cloud Object Detection (CVPR 20)

  • DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes (CVPR 20)

  • Object as Hotspots: An Anchor-Free 3D Object Detection Approach via Firing of Hotspots (ECCV 20)

  • SSN: Shape Signature Networks for Multi-class Object Detection from Point Clouds (ECCV 20)

  • Pillar-based Object Detection for Autonomous Driving (ECCV 20)

  • From Points to Parts: 3D Object Detection From Point Cloud With Part-Aware and Part-Aggregation Network (T-PAMI 20)

  • Reconfigurable Voxels: A New Representation for LiDAR-Based Point Clouds (CoRL 20)

  • SegVoxelNet: Exploring Semantic Context and Depth-aware Features for 3D Vehicle Detection from Point Cloud (ICRA 20)

  • TANet: Robust 3D Object Detection from Point Clouds with Triple Attention (AAAI 20)

  • SARPNET: Shape attention regional proposal network for liDAR-based 3D object detection (NeuroComputing 20)

  • Voxel-FPN: Multi-Scale Voxel Feature Aggregation for 3D Object Detection from LIDAR Point Clouds (Sensors 20)

  • BirdNet+: End-to-End 3D Object Detection in LiDAR Bird’s Eye View (ITSC 20)

  • 1st Place Solution for Waymo Open Dataset Challenge - 3D Detection and Domain Adaptation (arXiv 20)

  • AFDet: Anchor Free One Stage 3D Object Detection (arXiv 20)

2019

  • PointPillars: Fast Encoders for Object Detection from Point Clouds (CVPR 19)

  • End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds (CoRL 19)

  • IoU Loss for 2D/3D Object Detection (3DV 19)

  • Accurate and Real-time Object Detection based on Bird’s Eye View on 3D Point Clouds (3DV 19)

  • Focal Loss in 3D Object Detection (RA-L 19)

  • 3D-GIoU: 3D Generalized Intersection over Union for Object Detection in Point Cloud (Sensors 19)

  • FVNet: 3D Front-View Proposal Generation for Real-Time Object Detection from Point Clouds (CISP 19)

  • Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection (arXiv 19)

  • Patch Refinement - Localized 3D Object Detection (arXiv 19)

2018

  • VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection (CVPR 18)

  • PIXOR: Real-time 3D Object Detection from Point Clouds (CVPR 18)

  • SECOND: Sparsely Embedded Convolutional Detection (Sensors 18)

  • RT3D: Real-Time 3-D Vehicle Detection in LiDAR Point Cloud for Autonomous Driving (RA-L 18)

  • BirdNet: a 3D Object Detection Framework from LiDAR Information (ITSC 18)

  • YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCVW 18)

  • Complex-YOLO: An Euler-Region-Proposal for Real-time 3D Object Detection on Point Clouds (ECCVW 28)

2017 or earlier

  • 3D Fully Convolutional Network for Vehicle Detection in Point Cloud (IROS 17)

  • Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks (ICRA 17)

  • Vehicle Detection from 3D Lidar Using Fully Convolutional Network (RSS 16)

  • Voting for Voting in Online Point Cloud Object Detection (RSS 15)

3D Object Detection with Mixed Representations (point-voxel based)

2022

  • Behind the Curtain: Learning Occluded Shapes for 3D Object Detection (AAAI 22)

2021

  • LiDAR R-CNN: An Efficient and Universal 3D Object Detector (CVPR 21)

  • PVGNet: A Bottom-Up One-Stage 3D Object Detector with Integrated Multi-Level Features (CVPR 21)

  • HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection (CVPR 21)

  • Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection (ICCV 21)

  • Improving 3D Object Detection with Channel-wise Transformer (ICCV 21)

  • SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection (ICCVW 21)

  • From Voxel to Point: IoU-guided 3D Object Detection for Point Cloud with Voxel-to-Point Decoder (ACM MM 21)

  • RV-FuseNet: Range View Based Fusion of Time-Series LiDAR Data for Joint 3D Object Detection and Motion Forecasting (IROS 21)

  • Pattern-Aware Data Augmentation for LiDAR 3D Object Detection (ITSC 21)

  • From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object Detection (T-CSVT 21)

  • Pseudo-Image and Sparse Points: Vehicle Detection With 2D LiDAR Revisited by Deep Learning-Based Methods (T-ITS 21)

  • Dual-Branch CNNs for Vehicle Detection and Tracking on LiDAR Data (T-ITS 21)

  • Improved Point-Voxel Region Convolutional Neural Network: 3D Object Detectors for Autonomous Driving (T-ITS 21)

  • DSP-Net: Dense-to-Sparse Proposal Generation Approach for 3D Object Detection on Point Cloud (IJCNN 21)

  • P2V-RCNN: Point to Voxel Feature Learning for 3D Object Detection From Point Clouds (IEEE Access 21)

  • PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection (arXiv 21)

  • M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers (arXiv 21)

2020

  • PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection (CVPR 20)

  • Structure Aware Single-stage 3D Object Detection from Point Cloud (CVPR 20)

  • Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution (ECCV 20)

  • InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic Information Modeling (ECCV 20)

  • SVGA-Net: Sparse Voxel-Graph Attention Network for 3D Object Detection from Point Clouds (arXiv 20)

2019

2018

  • LMNet: Real-time Multiclass Object Detection on CPU Using 3D LiDAR (ACIRS 18)

LiDAR & Camera Fusion for 3D Object Detection (multi-modal)

2022

  • AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection (arXiv 22)

  • Fast-CLOCs: Fast Camera-LiDAR Object Candidates Fusion for 3D Object Detection (WACV 22)

2021

  • Multimodal Virtual Point 3D Detection (NeurIPS 21)

  • PointAugmenting: Cross-Modal Augmentation for 3D Object Detection (CVPR 21)

  • Frustum-PointPillars: A Multi-Stage Approach for 3D Object Detection using RGB Camera and LiDAR (ICCVW 21)

  • Multi-Stage Fusion for Multi-Class 3D Lidar Detection (ICCVW 21)

  • Cross-Modality 3D Object Detection (WACV 21)

  • Sparse-PointNet: See Further in Autonomous Vehicles (RA-L 21)

  • FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection (ITSC 21)

  • MF-Net: Meta Fusion Network for 3D object detection (IJCNN 21)

  • Multi-Scale Spatial Transformer Network for LiDAR-Camera 3D Object Detection (IJCNN 21)

  • Boost 3-D Object Detection via Point Clouds Segmentation and Fused 3-D GIoU-L1 Loss (T-NNLS)

  • RangeLVDet: Boosting 3D Object Detection in LIDAR with Range Image and RGB Image (Sensors Journal 21)

  • LiDAR Cluster First and Camera Inference Later: A New Perspective Towards Autonomous Driving (arXiv 21)

  • Exploring Data Augmentation for Multi-Modality 3D Object Detection (arXiv 21)

2020

  • PointPainting: Sequential Fusion for 3D Object Detection (CVPR 20)

  • 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection (ECCV 20)

  • EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection (ECCV 20)

  • PI-RCNN: An Efficient Multi-Sensor 3D Object Detector with Point-Based Attentive Cont-Conv Fusion Module (AAAI 20)

  • CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection IROS 20

  • LRPD: Long Range 3D Pedestrian Detection Leveraging Specific Strengths of LiDAR and RGB (ITSC 20)

  • Fusion of 3D LIDAR and Camera Data for Object Detection in Autonomous Vehicle Applications (Sensors Journal 20)

  • SemanticVoxels: Sequential Fusion for 3D Pedestrian Detection using LiDAR Point Cloud and Semantic Segmentation (MFI 20)

2019

  • Multi-Task Multi-Sensor Fusion for 3D Object Detection (CVPR 19)

  • Complexer-YOLO: Real-Time 3D Object Detection and Tracking on Semantic Point Clouds (CVPRW 19)

  • Sensor Fusion for Joint 3D Object Detection and Semantic Segmentation (CVPRW 19)

  • MVX-Net: Multimodal VoxelNet for 3D Object Detection (ICRA 19)

  • SEG-VoxelNet for 3D Vehicle Detection from RGB and LiDAR Data (ICRA 19)

  • 3D Object Detection Using Scale Invariant and Feature Reweighting Networks (AAAI 19)

  • Frustum ConvNet: Sliding Frustums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection (IROS 19)

  • Deep End-to-end 3D Person Detection from Camera and Lidar (ITSC 19)

  • RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement (IV 19)

  • SCANet: Spatial-channel attention network for 3D object detection (ICASSP 19)

  • One-Stage Multi-Sensor Data Fusion Convolutional Neural Network for 3D Object Detection (Sensors 19)

2018

  • Frustum PointNets for 3D Object Detection from RGB-D Data (CVPR 18)

  • PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation (CVPR 18)

  • Deep Continuous Fusion for Multi-Sensor 3D Object Detection (ECCV 18)

  • Joint 3D Proposal Generation and Object Detection from View Aggregation (IROS 18)

  • A General Pipeline for 3D Detection of Vehicles (ICRA 18)

  • Fusing Bird’s Eye View LIDAR Point Cloud and Front View Camera Image for 3D Object Detection (IV 18)

  • Robust Camera Lidar Sensor Fusion Via Deep Gated Information Fusion Network (IV 18)

2017 or earlier

  • Multi-View 3D Object Detection Network for Autonomous Driving (CVPR 17)

Tried to dive into code base PointPillars later on (below).

3.15 PointPillars & Follow-up related to CLIP

1. PointPillars

  • Setup the environment for PointPillars. The code base doesn’t need any of the 3D Object Detection framework like OpenPCDet or MMDetection3d.
  • Figure out the network architecture of PointPillars by looking into the code except for the network forwarding (left for tomorrow).
  • The following thing to be examined will be the generation of the GT box, loss function, training, and evaluation of the model.

2. Open Vocabulary Segmentation paper review

  • Language-driven Semantic Segmentation (Lseg)
    • Advantange:
      • Accomplished using CLIP to do OVS.
    • Downside:
      • Still a supervised learning method.
      • Accuracy is still uncomparable as compared to 1-shot method.
      • Didn’t use the text as a supervised signal, meaning that the model is still relying on manual annotated mask to accomplish segmentation.
  • GroupViT: Semantic Segmentation Emerges from Text Supervision (GroupViT)
    • Advantage:
      • Added Grouping Block (cluster the group) and learnable Group tokens (clustering centers).
    • Downside:
      • Still a patch-level segmentation. The final output uses bilinear interpolation.

3. Comparison between different frameworks of 3D object detection

OpenPCDet

  • GitHub: OpenPCDet

  • Community: Active

  • Code Quality: Lightweight and readable. Some top journal papers in recent years were developed based on this framework.

  • Deployment:

    More convenient than other frameworks, with existing deployment implementations available. Example deployment solutions include:

  • Recommendation: A recommended starting point for beginners interested in learning about object detection frameworks.

mmdetection3d

  • GitHub: mmdetection3d
  • Community: Active
  • Documentation: Official documentation available, facilitating easier onboarding.
  • Scope: Compared to OpenPCDet, mmdetection3d encompasses a broader range of scenarios, including 3D detection and segmentation for images, point clouds, and multimodal data sources.
  • Code Quality: Well-packaged, but might be more challenging for beginners. The framework is also the basis for some top journal papers in recent years.
  • Model Deployment: Still in experimental phase.
  • Recommendation: Suitable for those familiar with 3D object detection, offering richer content for further learning.

Det3D

  • GitHub: Det3D
  • Community Feedback: Previously reviewed but not extensively used in recent developments. Similar to OpenPCDet in being lightweight, but has not been updated recently.
  • Deployment Solutions: Some existing deployment solutions are based on this framework, such as CenterPoint’s deployment: CenterPoint Deployment
  • Recommendation: Lower priority for learning compared to OpenPCDet and mmdetection3d.

Paddle3D

  • GitHub: Paddle3D
  • Community Feedback: Newly open-sourced as of August 2022, with unclear performance outcomes at the moment.

After the review, I plan to get on OpenPCDet first for its ease of understanding.

3.16 PointPillars

  • Learned how anchor is generated and aligned with the GT boxes.
  • Loss function computation is also understood, which uses Focal Loss for <cls> loss, and smoothL1 for regression loss.
  • The training process is figured out, which uses torch.optim.AdamW() as the optimizer and torch.optim.lr_scheduler.OneCycleLR() as the scheduler.
  • The following things left to be done is the prediction for a single cloud points (involves NMS), the visualization step, and the metrics and evaluation methods used in 3D object detection area.
  • Ran KITTI benchmark on the model to check the accuracy. benchmark

3.17 Quantization & Swin-transformer

1. Quantization concept (apart from Pytorch)

  • Linear:

    • Affine: the importance of Z offset.

      image-20240320135437695

      • int8 ( -128 ~ 127 )
      • uint8 ( 0 ~ 255 )
      • Minmax (above)
      • Histogram (two arrows shrink towards each other until below or equal to the required coverage percentage) image-20240320135842827
      • Entropy: TO BE CONTINUED
    • Symmetric: more concise as compared to asymmetric method.

      image-20240320135639201

  • Non-linear

2. Swin-transformer:

  • Used the concept of conv to let the attention be focused first only on local part and then global perspective image-20240320134821271
  • The architecture of the network is: image-20240320134904678
  • Completely gone through the code base of Swin-transformer.

3.18 Quantization about Matrix Multiplication

1. im2col

(3 channels)

(3 kernals, transform to below)

Therefore, conv operation is successfully transformed into matrix multiplication (one kernal):

image-20240320140657978

The case for three kernel is the same:

2. Quantization in matrix multiplication (conv)

image-20240320140913509

image-20240320140932612

image-20240320140948137

image-20240320141000040

As we can see, in order to leverage GEMM(INT8) acceleration, we must kill the variable k in s.

image-20240320141114105

If we kill k in s, then on the right hand side, it’s just basic INT8 matrix multiplication with the left hand side is the scale. Killing the k also means that the scale s will be shared across row respective to X and across column respective to W:

image-20240320141211703

Then per-channel is easy to explain:

image-20240320141423724

3.19 Quantization with code

1. Different methods

  • Dynamic Quantization: The easiest method of quantization PyTorch supports is called dynamic quantization. This involves not just converting the weights to int8 - as happens in all quantization variants - but also converting the activations to int8 on the fly, just before doing the computation (hence “dynamic”).

    import torch.quantization
    quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    

Additional adjustment may be involved, like replace the original add and concat operation with nn.quantized.FloatFunctional

  • Post-Training Static Quantization:

  • Static quantization performs the additional step of first feeding batches of data through the network and computing the resulting distributions of the different activations (specifically, this is done by inserting “observer” modules at different points that record these distributions). This information is used to determine how specifically the different activations should be quantized at inference time

    # set quantization config for server (x86)
    deploymentmyModel.qconfig = torch.quantization.get_default_config('fbgemm')
    
    # insert observers
    torch.quantization.prepare(myModel, inplace=True)
    # Calibrate the model and collect statistics
    
    # convert to quantized version
    torch.quantization.convert(myModel, inplace=True)
    
  • Quantization Aware Training:

  • Quantization-aware training(QAT) is the third method, and the one that typically results in highest accuracy of these three. With QAT, all weights and activations are “fake quantized” during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all the weight adjustments during training are made while “aware” of the fact that the model will ultimately be quantized; after quantizing, therefore, this method usually yields higher accuracy than the other two methods.

    # specify quantization config for QAT
    qat_model.qconfig=torch.quantization.get_default_qat_qconfig('fbgemm')
    
    # prepare QAT
    torch.quantization.prepare_qat(qat_model, inplace=True)
    
    # convert to quantized version, removing dropout, to check for accuracy on each
    epochquantized_model=torch.quantization.convert(qat_model.eval(), inplace=False)
    

    During the re-training process, we’ll forward the loss of the mimic int8 values and the true values to the optimizer, so that this quantization loss will be optimized during re-training.

    • Forward process: use one of the methods from MinMax, Histogram, and Entropy to mimic the quantization
    • Backward process: use smooth derivative method so that we can get the derivarive TO BE CONTINUED

2. Different frameworks

pytorch-quantization from Nvidia V.S. torch.ao.quantization from native Pytorch

  • torch.ao.quantization aims primarily on CPU while pytorch-quantization focuses on Nvidia platform deployment.
  • Overall same procedure.
  • More people using the first one.
  • No one answers this question.

Things that I plan to do next week

1. More understanding in quantization

2. Span on more code of 3D object detection model

  • Complete the review of code for PointPillars’ predict and visualization process (link above).
  • BevFormer
  • CenterNet and CenterPoint
  • and more…

3. Empirical experiment on quantization on 3d object detection model

Highly possible that the performance will not be that good on PointPillars since:

image-20240320161318119

by LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection.

If bad, I’ll go understand the essay to explore why.

Apply all three types of methods on these models, check the performance and sort out why.

By next week

  • Get CenterPoint run on evaluation.
  • Quantize CenterPoint using the three methods to check the performance as compared to un-quantized one:
    • Latency for each inference.
    • Use CUDA memory API to calculate total memory usage by the model
    • Either calculate the activation memory usage by hand or use some other tools
  • After all these, try apply SmoothQuant to the model and see the result.