Meeting Discussion (11)

1. Table of Contents The feasibility of quantization and SQ on SparseConv3d: ✅ The discussion of sparse conv3d: Find where the computation for sparse conv3d is done: ✅ Inside ops.implicit_gemm: ✅ How can we implement it: Quantization of the activation and weight: ✅ Application of SQ during quantization process: ✅ Evaluation of current method: Whether SQ is effective or not depends on the value of the inputs and weight: ✅ Actual implementation: In progress… 2....

June 16, 2024 · 4 min · Banghao Chi

LLMarking

This is the official repo for Automatic Short Answer Grading (ASAG) project, named LLMarking, from Xi’an Jiaotong Liverpool University (XJTLU). Using vLLM as the Large Language Model (LLM) inference framework and FastAPI as the HTTP service framework, this project can achieve high throughput of both LLM tokens delivered and request handling. Feature This project aims to achieve high concurrency automatic short answer grading (ASAG) system and implement the construction of service....

June 12, 2024 · 3 min · Banghao Chi

Meeting Discussion (10)

1. Table of Contents: Final results and comparison: ✅ Do L1loss tests within the model again to see which part (SmoothQuant or transformation) has greater benefits: ✅ Do multiple tests on 50X input-scale with the scaling factor of SmoothQuant changing to see if other factors can provide better results (lower L1loss): ✅ Lots of experiments to analyze the accuracy loss: ✅ Modify the model based on the above experiments: ✅ Validate the accuracy of the final model: through mAP and NDS: ✅ 2....

May 21, 2024 · 7 min · Banghao Chi

Meeting Discussion (9)

1. Table of Contents Implementation of im2col+gemm operation: ✅ Add INT8-quantizer to the first operation: ✅ Add SmoothQuant to the second operation: ✅ Verify operation through different example inputs: ✅ Integrate im2col+gemm SmoothQuant INT8-quantized layer into model: ✅ Validate the accuracy of the layer by: through different example inputs: ✅ through actual data flow of the model: ✅ through accuracy: In progress… (due to Delta only comes back online really late) 2....

May 17, 2024 · 4 min · Banghao Chi

Meeting Discussion (8)

1. Table of Contents Original implementation of SmoothQuant and why it’s not correct: ✅ The correct way of implementation IMO: ✅ 2. Original way of getting absMax values def register_collect_smoothquant_hook(model, data_loader, num_batch=200): model.eval() act_scales = {} weight_scales = {} def forward_hook(module, input, name): hidden_dim_act = input[0].shape[1] tensor_act = input[0].view(-1, hidden_dim_act).abs().detach() comming_max_act = torch.max(tensor_act, dim=0)[0].float().cpu() if name not in act_scales: act_scales[name] = comming_max_act else: act_scales[name] = torch.max(act_scales[name], comming_max_act) Input shape: [4, 256, 182, 182] hidden_dim_act = 256 tensor_act: [4*182*182, 256] torch....

May 14, 2024 · 1 min · Banghao Chi

Meeting Discussion (7)

1. Table of Contents Implementation of SmoothQuant on Conv2d: ✅ Validation of the above implementation: ✅ (for $ \alpha = 0.5 $) 2. Implementation of SmoothQuant operation on Conv2d Get activation scale Get weight scale Compute smoothing factor $ s $ based on above two scales Apply scaling: $\text{input} \mathrel{{/}{=}} s$ $\text{weight} \mathrel{{*}{=}} s$ 2.1 Get activation & weight scale Take a look at the shape of activation, output, and weight in Conv2d: Take one layer as an example: Input shape: torch....

May 10, 2024 · 3 min · Banghao Chi

Meeting Discussion (6)

1. Table of Contents In-depth Memory Usage Visualization: ✅ Ideas about how to implement quantization of sparse conv3d: ✅ Ideas about how to implement SmoothQuant operation on conv2d: ✅ 2. Large Chunk GPU Memory Usage Overview Data loader Backbone 3d Backbone 3d -> Backbone 2d Backbone 2d Head Below are structure for each major chunk: Data Loader Backbone 3d 3d feature to 2d feature Backbone 2d Head 3. How to implement quantization of sparse conv3d?...

May 2, 2024 · 1 min · Banghao Chi

Meeting Discussion (5)

1. Table of Contents Accuracy graph under diffrerent quantization metrics: ✅ Max value within the layers: ✅ 2. Accuracy graph under diffrerent quantization metrics: As we can observe from both graphs, activation is clearly influced more by quantization. 3. Max value within the layers In the first graph, we can see that the max value within the weigh ranges from 0.1 to 2.94, while in the second graph, we can find an interesting max value pattern, with its value ranging from 8 to 53....

May 1, 2024 · 1 min · Banghao Chi

Meeting Discussion (4)

1. Table of Contents The strcuture of the CenterPoint-Vexel model: ✅ Inference time of each layer: ✅ Memory usage of each layer: ✅ Storage usage of each layer: ✅ Quantization of the model: ✅ 2. The strcuture of the CenterPoint-Vexel model CenterPoint( (vfe): MeanVFE() (backbone_3d): VoxelResBackBone8x( (conv_input): SparseSequential( (0): SubMConv3d(5, 16, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(16, eps=0....

April 26, 2024 · 16 min · Banghao Chi

Meeting Discussion (3)

What has been done? This week’s work: Setup nuScenes training and validation dataset for MMDetection3D framework: ✅ Setup waymo training and validation dataset for MMDetection3D framework: ✅ Wrote API for PTQ under pytorch-quantization framework. (Now just need model and dataloader definition): ✅ Complete walkthrough of CAT-Seg, a SOTA Open Vocabulary Segmentation (OVS) model: ✅ What to discuss? Is this quantization way appropriate? Any advice on the changing of model structure (CAT-Seg)?...

April 2, 2024 · 2 min · Banghao Chi