Bohan Zhai

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-35

May 7, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-35.pdf

3D point-cloud-based perception is a challenging but crucial computer vision task. A point cloud consists of a sparse, unstructured, and unordered set of points. To understand a point cloud, previous point-based methods, such as PointNet++, extract visual features through the hierarchical aggregation of local features. However, such methods have several critical limitations: 1) Such methods require several sampling and grouping operations, which slow down the inference speed. 2) Such methods spend an equal amount of computation on each point in a point cloud, though many of the points have similar semantic meanings. 3) Such methods aggregate local features together through down-sampling, which leads to information loss and hurts the perception performance. To overcome these challenges, we propose a simple, and elegant deep learning model called YOGO (You Only Group Once). YOGO divides a point cloud into a small number of parts and extracts a high-dimensional token to represent points within each sub-region. Next, we use self-attention to capture token-to-token relations and project the token features back to the point features. We formulate the mentioned series of operations as a relation inference module (RIM). Compared with previous methods, YOGO only needs to sample and group a point cloud once, thus it is very efficient. Instead of operating on points, \textit{YOGO} operates on a finite and small number of tokens, each of which summarizes the point features in a sub-region. This allows us to avoid computing on the redundant points and thus boosts efficiency. Moreover, YOGO preserves point-wise features by projecting token features to point features although the computation is performed on tokens. This avoids information loss and can improve point-wise perception performance. We conduct thorough experiments to demonstrate that YOGO achieves at least 3.0x speedup over point-based baselines while delivering competitive segmentation performance on the ShapeNetPart and S3DIS dataset.

Advisors: Kurt Keutzer and Joseph Gonzalez


BibTeX citation:

@mastersthesis{Zhai:EECS-2021-35,
    Author= {Zhai, Bohan},
    Title= {You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-35.html},
    Number= {UCB/EECS-2021-35},
    Abstract= {3D point-cloud-based perception is a challenging but crucial computer vision task. A point cloud consists of a sparse, unstructured, and unordered set of points. 
To understand a point cloud, previous point-based methods, such as PointNet++, extract visual features through the hierarchical aggregation of local features. However, such methods have several critical limitations: 1) Such methods require several sampling and grouping operations, which slow down the inference speed. 2) Such methods spend an equal amount of computation on each point in a point cloud, though many of the points have similar semantic meanings. 3) Such methods aggregate local features together through down-sampling, which leads to information loss and hurts the perception performance. To overcome these challenges, we propose a simple, and elegant deep learning model called YOGO (You Only Group Once). YOGO divides a point cloud into a small number of parts and extracts a high-dimensional token to represent points within each sub-region. Next, we use self-attention to capture token-to-token relations and project the token features back to the point features. We formulate the mentioned series of operations as a relation inference module (RIM). Compared with previous methods, YOGO only needs to sample and group a point cloud once, thus it is very efficient.  
Instead of operating on points, \textit{YOGO} operates on a finite and small number of tokens, each of which summarizes the point features in a sub-region. This allows us to avoid computing on the redundant points and thus boosts efficiency.
Moreover, YOGO preserves point-wise features by projecting token features to point features although the computation is performed on tokens. This avoids information loss and can improve point-wise perception performance. 
We conduct thorough experiments to demonstrate that YOGO achieves at least 3.0x speedup over point-based baselines while delivering competitive segmentation performance on the ShapeNetPart and S3DIS dataset.},
}

EndNote citation:

%0 Thesis
%A Zhai, Bohan 
%T You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module
%I EECS Department, University of California, Berkeley
%D 2021
%8 May 7
%@ UCB/EECS-2021-35
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-35.html
%F Zhai:EECS-2021-35