You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module
Bohan Zhai
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2021-35
May 7, 2021
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-35.pdf
3D point-cloud-based perception is a challenging but crucial computer vision task. A point cloud consists of a sparse, unstructured, and unordered set of points. To understand a point cloud, previous point-based methods, such as PointNet++, extract visual features through the hierarchical aggregation of local features. However, such methods have several critical limitations: 1) Such methods require several sampling and grouping operations, which slow down the inference speed. 2) Such methods spend an equal amount of computation on each point in a point cloud, though many of the points have similar semantic meanings. 3) Such methods aggregate local features together through down-sampling, which leads to information loss and hurts the perception performance. To overcome these challenges, we propose a simple, and elegant deep learning model called YOGO (You Only Group Once). YOGO divides a point cloud into a small number of parts and extracts a high-dimensional token to represent points within each sub-region. Next, we use self-attention to capture token-to-token relations and project the token features back to the point features. We formulate the mentioned series of operations as a relation inference module (RIM). Compared with previous methods, YOGO only needs to sample and group a point cloud once, thus it is very efficient. Instead of operating on points, \textit{YOGO} operates on a finite and small number of tokens, each of which summarizes the point features in a sub-region. This allows us to avoid computing on the redundant points and thus boosts efficiency. Moreover, YOGO preserves point-wise features by projecting token features to point features although the computation is performed on tokens. This avoids information loss and can improve point-wise perception performance. We conduct thorough experiments to demonstrate that YOGO achieves at least 3.0x speedup over point-based baselines while delivering competitive segmentation performance on the ShapeNetPart and S3DIS dataset.
Advisors: Kurt Keutzer and Joseph Gonzalez
BibTeX citation:
@mastersthesis{Zhai:EECS-2021-35, Author= {Zhai, Bohan}, Title= {You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module}, School= {EECS Department, University of California, Berkeley}, Year= {2021}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-35.html}, Number= {UCB/EECS-2021-35}, Abstract= {3D point-cloud-based perception is a challenging but crucial computer vision task. A point cloud consists of a sparse, unstructured, and unordered set of points. To understand a point cloud, previous point-based methods, such as PointNet++, extract visual features through the hierarchical aggregation of local features. However, such methods have several critical limitations: 1) Such methods require several sampling and grouping operations, which slow down the inference speed. 2) Such methods spend an equal amount of computation on each point in a point cloud, though many of the points have similar semantic meanings. 3) Such methods aggregate local features together through down-sampling, which leads to information loss and hurts the perception performance. To overcome these challenges, we propose a simple, and elegant deep learning model called YOGO (You Only Group Once). YOGO divides a point cloud into a small number of parts and extracts a high-dimensional token to represent points within each sub-region. Next, we use self-attention to capture token-to-token relations and project the token features back to the point features. We formulate the mentioned series of operations as a relation inference module (RIM). Compared with previous methods, YOGO only needs to sample and group a point cloud once, thus it is very efficient. Instead of operating on points, \textit{YOGO} operates on a finite and small number of tokens, each of which summarizes the point features in a sub-region. This allows us to avoid computing on the redundant points and thus boosts efficiency. Moreover, YOGO preserves point-wise features by projecting token features to point features although the computation is performed on tokens. This avoids information loss and can improve point-wise perception performance. We conduct thorough experiments to demonstrate that YOGO achieves at least 3.0x speedup over point-based baselines while delivering competitive segmentation performance on the ShapeNetPart and S3DIS dataset.}, }
EndNote citation:
%0 Thesis %A Zhai, Bohan %T You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module %I EECS Department, University of California, Berkeley %D 2021 %8 May 7 %@ UCB/EECS-2021-35 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-35.html %F Zhai:EECS-2021-35