add ass 3

This commit is contained in:
Dustella 2025-04-18 11:12:57 +08:00
parent 1d572981f8
commit 6ad0f70619
Signed by: Dustella
GPG Key ID: 35AA0AA3DC402D5C
9 changed files with 1519 additions and 321 deletions

View File

@ -0,0 +1,110 @@
Research Advancements in Efficient 3D Point Cloud Segmentation: A Literature Review with a Focus on RandLA-Net
The three-dimensional point cloud has emerged as a pivotal data structure for representing the physical world, facilitating a more profound understanding and interaction with complex 3D environments. Its significance spans numerous domains, including computer vision, robotics, autonomous vehicles, and augmented reality, where it serves as the foundation for tasks such as object detection, comprehensive scene understanding, and precise navigation.1 In particular, the semantic segmentation of 3D point clouds, which entails assigning a specific label to each individual point within a dataset, offers a richer and more detailed representation of an environment compared to simpler tasks like object detection that only delineate bounding boxes.2 This granular understanding is crucial for applications such as autonomous driving, where detailed environmental perception is necessary for safe and effective navigation.2 The increasing availability of 3D data from various sensing technologies underscores the importance of developing efficient methods to process and analyze these large datasets, extracting meaningful high-level features for practical applications.4 Point cloud semantic segmentation plays a vital role in this context, dividing the original point cloud into semantically distinct subsets, thereby enabling a more nuanced interpretation of the 3D world.4 While traditional approaches to 3D segmentation relied on hand-crafted features, the field has increasingly embraced deep learning techniques due to their superior ability to learn complex and generalizable features directly from the data.5 The direct acquisition of 3D data through technologies like motion capture further highlights the need for efficient processing to enable real-time applications.6 Despite the growing importance of point clouds, their inherent characteristics, such as sparsity, irregularity, and lack of an ordered structure, present significant challenges for processing and analysis, especially when dealing with the vast amounts of data generated by modern sensors.1 Moreover, the density of point clouds can vary considerably depending on the distance from the sensor, posing particular issues in outdoor environments.2 The limited availability of large-scale, accurately labeled datasets for 3D semantic segmentation also complicates the development and evaluation of robust deep learning models.2
The advent of deep learning has revolutionized the field of point cloud processing, with pioneering works like PointNet establishing the feasibility of directly learning from these unstructured datasets.10 PointNet introduced a novel architecture that learns per-point features using shared multilayer perceptrons (MLPs) and aggregates these features using a symmetric function, typically max pooling, to achieve invariance to the order of points in the input.10 This approach marked a significant departure from earlier methods that required converting point clouds into regular formats like voxels or images before processing.13 By directly consuming point clouds, PointNet eliminated the need for manual feature engineering and offered a computationally efficient way to analyze 3D shapes.11 Furthermore, PointNet incorporated input and feature transformation networks (T-Nets) to ensure robustness to rigid transformations such as rotation and translation.11 This foundational work demonstrated the potential of deep learning for various 3D recognition tasks, including object classification, part segmentation, and scene semantic parsing.11 However, a key limitation of PointNet was its independent processing of each point, which prevented it from effectively capturing local spatial relationships and the intricate geometric structures present in point clouds.10 This lack of local context understanding hindered its performance in tasks requiring fine-grained segmentation and generalization to complex scenes, motivating the development of subsequent architectures.
To address PointNet's limitations in capturing local structures, PointNet++ was introduced as a hierarchical neural network that applies PointNet recursively on nested partitions of the input point set.10 This hierarchical approach enables the network to learn features at different scales and capture local details more effectively by exploiting the metric space distances between points.16 The PointNet++ architecture employs set abstraction levels, each consisting of a sampling layer, a grouping layer, and a PointNet layer.17 The sampling layer often utilizes Farthest Point Sampling (FPS) to select a representative subset of points, while the grouping layer identifies neighboring points around these sampled centroids using ball queries or k-nearest neighbors. The PointNet layer then processes these local groups to extract features.17 To handle the variability in point densities, PointNet++ incorporates multi-scale grouping (MSG), which extracts and aggregates features at different scales by varying the neighborhood size.17 For semantic segmentation tasks, a decoder with feature propagation modules is typically used to interpolate the learned features back to the original point cloud resolution for per-point classification.19 Experimental results have demonstrated that PointNet++ achieves significantly better performance than PointNet on challenging 3D point cloud benchmarks.10 However, the use of FPS in PointNet++ can be computationally expensive for very large point clouds, and the network still processes points somewhat independently within local groups, not fully considering the intricate relationships between them.7
RandLA-Net presents an alternative approach focused on achieving high efficiency for the semantic segmentation of large-scale 3D point clouds by primarily utilizing random point sampling.7 This choice of sampling strategy stands in contrast to more complex methods like FPS, which can become a computational bottleneck for massive datasets.7 To address the potential loss of crucial features due to the simplicity of random sampling, RandLA-Net introduces a novel Local Feature Aggregation (LFA) module.7 The LFA module is designed to progressively increase the receptive field for each point, effectively preserving geometric details despite the aggressive downsampling inherent in random selection.7 The LFA module comprises three key units: Local Spatial Encoding (LocSE), Attentive Pooling, and a Dilated Residual Block.10 LocSE explicitly encodes the relative spatial coordinates of neighboring points, enabling the network to learn local geometric patterns.10 Attentive Pooling employs an attention mechanism to weight and aggregate the features of these neighboring points, focusing on the most informative ones.10 Finally, the Dilated Residual Block stacks multiple LocSE and Attentive Pooling units to efficiently enlarge the receptive field of each point.10 The overall RandLA-Net architecture typically follows an encoder-decoder structure with skip connections 26, and it primarily utilizes shared MLPs for computational efficiency.10 Notably, RandLA-Net is end-to-end trainable and does not require computationally intensive pre- or post-processing steps like voxelization or graph construction.10
The efficiency of RandLA-Net is a key aspect of its design, enabling it to process very large point clouds rapidly. It has been shown to process up to 1 million points in a single pass, achieving speeds up to 200 times faster than methods like SPG.10 The random sampling strategy employed by RandLA-Net has a computational complexity of O(1) per sampling operation, making it highly scalable to massive datasets, especially when compared to the O(N²) complexity of FPS.10 Experimental results confirm that random sampling in RandLA-Net is significantly faster than FPS and IDIS for large point clouds.10 Furthermore, RandLA-Net maintains a relatively small memory footprint with around 1.24 million parameters 10 and can process up to 1.03 million points in a single pass due to its memory-efficient design.10 This efficiency stems from the combination of fast random sampling and the lightweight MLP-based local feature aggregation module.10 Even with the emergence of more recent architectures, RandLA-Net continues to demonstrate a strong balance between speed and performance on challenging benchmarks like the S3DIS 6-fold segmentation task.24
RandLA-Net has demonstrated strong performance on several key 3D point cloud semantic segmentation benchmarks. On the Semantic3D dataset, it has outperformed methods like SPG in both mIoU and OA.10 Similarly, on the SemanticKITTI dataset, RandLA-Net has shown superior mIoU compared to other point-based approaches, including PointNet++ and SPG.10 Evaluations on the S3DIS indoor scene segmentation dataset also indicate that RandLA-Net achieves higher OA, mAcc, and mIoU compared to these earlier methods.10 Moreover, studies applying RandLA-Net to urban environments have reported high F1 scores for segmenting point clouds in various cities, often leveraging transfer learning to address data scarcity.31 Subsequent improvements to the architecture, such as RandLA-Net++ and RandLA-Net3+, have further enhanced performance on urban scene datasets.7 RandLA-Net has also been successfully applied to specialized domains like foreign object detection in nuclear reactors with very high accuracy 32 and for semantic segmentation in the creation of high-definition maps for autonomous driving.28 While RandLA-Net generally performs well, it is worth noting that other efficient architectures like KPConv might achieve slightly better accuracy in some cases, although potentially with higher computational costs 18, and attention-based networks like Point Transformer can sometimes outperform RandLA-Net by better capturing global context.27
Kernel Point Convolution (KPConv) represents another efficient architecture for point cloud segmentation, operating directly on point clouds using a set of kernel points to define convolution weights in Euclidean space.10 This approach offers flexibility as the number and locations of these kernel points can be learned, even allowing for deformable convolutions that adapt to local geometry.34 KPConv is well-suited for handling the irregular nature of point clouds and has shown competitive results on various datasets.22 Its architecture allows for building deep networks 34, and in some comparisons, it has achieved top performance in terms of OA and mIoU, although potentially with longer training times than RandLA-Net.18 Recent advancements include lighter versions like KPConvD and attention-enhanced versions like KPConvX, aiming for improved performance and efficiency.35 An extension called IPCONV explores different kernel point generation strategies for enhanced feature learning.39
Point Transformer leverages the power of self-attention networks, inspired by their success in NLP and image analysis, for 3D point cloud processing.10 By applying self-attention layers, Point Transformer can capture long-range dependencies and model relationships between points across the entire point cloud.27 The self-attention mechanism is inherently invariant to the order and number of input points, making it suitable for point cloud data.43 Point Transformer has demonstrated strong performance in tasks like semantic scene segmentation and object part segmentation, achieving state-of-the-art results on datasets like S3DIS.40 Recent versions like Point Transformer V3 (PTv3) prioritize efficiency and simplicity, achieving significant improvements in speed and memory usage while expanding the receptive field.41
When comparing these efficient architectures, RandLA-Net stands out for its speed in processing large-scale point clouds, often significantly faster than SPG and PointNet++.10 While KPConv can achieve competitive accuracy, it might require longer training times compared to RandLA-Net.18 Point Transformer, especially its more recent efficient iterations, offers strong performance by capturing global context but might have varying computational costs depending on the version.27 A common challenge for earlier methods like PointNet and PointNet++ was their difficulty in directly processing massive point clouds due to computational or memory limitations.10 RandLA-Net's reliance on random sampling and its lightweight design address these issues effectively.10 In contrast, SPG involves computationally intensive pre-processing, and voxelization used by some methods can also be demanding.10 RandLA-Net's end-to-end trainability without complex pre- or post-processing is a significant advantage.10 However, in very complex scenarios, RandLA-Net might exhibit some limitations in classification accuracy compared to methods that better capture fine-grained details or global context.27
Recent research in efficient 3D point cloud segmentation is increasingly exploring label-efficient learning to mitigate the high cost of data annotation.9 Graph Neural Networks (GNNs) are also gaining traction due to their ability to handle unstructured data and leverage geometric relationships within point clouds.8 Serialization-based methods, such as certain Point Transformer variants, are emerging as efficient ways to process point clouds by converting them into ordered sequences.41 Techniques like sparse voxelization and superpoint graphs continue to be utilized in some efficient architectures.10 Multi-modal fusion, which combines information from different sensors, is being investigated to enhance segmentation accuracy.2 Adapting efficient architectures like RandLA-Net for specific applications, such as urban scene understanding and industrial inspection, is also an active area of research.28 The development of lightweight architectures with reduced training costs remains a priority 51, as does the modernization of existing methods like KPConv with attention mechanisms.35
In conclusion, the field of efficient 3D point cloud segmentation has made remarkable progress, with various architectures offering different strengths and trade-offs. Foundational works like PointNet and PointNet++ paved the way for direct point cloud processing and hierarchical feature learning. RandLA-Net stands out for its exceptional efficiency, particularly in terms of processing speed and scalability for large-scale datasets, achieved through its strategic use of random sampling and effective local feature aggregation. While RandLA-Net offers a compelling balance of speed and performance, alternative architectures like KPConv and Point Transformer provide different advantages, such as flexibility in convolution operations and the ability to capture global context, respectively. The ongoing research landscape is characterized by a focus on addressing key challenges like data annotation costs and the need for further improvements in accuracy and efficiency through techniques like label-efficient learning, GNNs, serialization, multi-modal fusion, and architectural innovations. The choice of the most suitable technique ultimately depends on the specific requirements and constraints of the target application, highlighting the continued importance of exploring diverse approaches to achieve efficient and accurate 3D point cloud segmentation.
| Model Name | Total Time (seconds) | Network Parameters (millions) | Max Inference Points (millions) |
| ------------------ | -------------------- | ----------------------------- | ------------------------------- |
| PointNet (Vanilla) | 192 | 0.8 | 0.49 |
| PointNet++ (SSG) | 9831 | 0.97 | 0.98 |
| PointCNN | 8142 | 11 | 0.05 |
| SPG | 43584 | 0.25 | - |
| KPConv | 717 | 14.9 | 0.54 |
| RandLA-Net | 185 | 1.24 | 1.03 |
### Table 1: Comparison of Processing Time and Memory Consumption on SemanticKITTI (Sequence 08)
| Model Name | Overall Accuracy (OA) | Mean Intersection over Union (mIoU) |
| ---------- | --------------------- | ----------------------------------- |
| SPG | 94.0% | 73.2% |
| RandLA-Net | 94.8% | 77.4% |
### Table 2: Performance Comparison on Semantic3D
| Model Name | Mean Intersection over Union (mIoU) |
| ---------- | ----------------------------------- |
| PointNet++ | 20.1% |
| SPG | 17.4% |
| RandLA-Net | 53.9% |
### Table 3: Performance Comparison on SemanticKITTI
| Model Name | Overall Accuracy (OA) | Mean Accuracy (mAcc) | Mean Intersection over Union (mIoU) |
| ---------- | --------------------- | -------------------- | ----------------------------------- |
| PointNet++ | 81.0% | 74.1% | 54.5% |
| SPG | 85.5% | 77.5% | 62.1% |
| RandLA-Net | 88.0% | 82.0% | 70.0% |
### Table 4: Performance Comparison on S3DIS 10
Works cited
Foundational Models for 3D Point Clouds: A Survey and Outlook - arXiv, accessed on April 13, 2025, https://arxiv.org/html/2501.18594v1
Point Cloud Based Scene Segmentation: A Survey - arXiv, accessed on April 13, 2025, https://arxiv.org/html/2503.12595v1
[2503.12595] Point Cloud Based Scene Segmentation: A Survey - arXiv, accessed on April 13, 2025, https://arxiv.org/abs/2503.12595
Deep-Learning-Based Point Cloud Semantic Segmentation: A Survey, accessed on April 13, 2025, https://www.mdpi.com/2079-9292/12/17/3642
Deep Learning Based 3D Segmentation: A Survey - arXiv, accessed on April 13, 2025, https://arxiv.org/html/2103.05423v5
A Survey on Deep Learning Based Segmentation, Detection and Classification for 3D Point Clouds - PMC, accessed on April 13, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC10137403/
Multi-Feature Aggregation for Semantic Segmentation of an Urban Scene Point Cloud, accessed on April 13, 2025, https://www.mdpi.com/2072-4292/14/20/5134
Graph Neural Networks in Point Clouds: A Survey - MDPI, accessed on April 13, 2025, https://www.mdpi.com/2072-4292/16/14/2518
A Survey of Label-Efficient Deep Learning for 3D Point Clouds - arXiv, accessed on April 13, 2025, https://arxiv.org/html/2305.19812v2
openaccess.thecvf.com, accessed on April 13, 2025, https://openaccess.thecvf.com/content_CVPR_2020/papers/Hu_RandLA-Net_Efficient_Semantic_Segmentation_of_Large-Scale_Point_Clouds_CVPR_2020_paper.pdf
openaccess.thecvf.com, accessed on April 13, 2025, https://openaccess.thecvf.com/content_cvpr_2017/papers/Qi_PointNet_Deep_Learning_CVPR_2017_paper.pdf
www.cs.ox.ac.uk, accessed on April 13, 2025, https://www.cs.ox.ac.uk/files/11502/RandLA_Net__Efficient_Semantic_Segmentation_of_Large_Scale_Point_Clouds.pdf
A quick summary of 3D point cloud segmentation techniques - Mindkosh AI, accessed on April 13, 2025, https://mindkosh.com/blog/a-summary-of-3d-point-cloud-segmentation-techniques/
Deep Learning for Point Cloud Segmentation: Whats going on with PointNet? - Reddit, accessed on April 13, 2025, https://www.reddit.com/r/LiDAR/comments/hxc7y2/deep_learning_for_point_cloud_segmentation_whats/
Point cloud segmentation with PointNet - Keras, accessed on April 13, 2025, https://keras.io/examples/vision/pointnet_segmentation/
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space - Stanford University, accessed on April 13, 2025, https://stanford.edu/~rqi/pointnet2/
proceedings.neurips.cc, accessed on April 13, 2025, https://proceedings.neurips.cc/paper/7095-pointnet-deep-hierarchical-feature-learning-on-point-sets-in-a-metric-space.pdf
Evaluation Point Cloud semantic Segmentation methods - kth .diva, accessed on April 13, 2025, https://kth.diva-portal.org/smash/get/diva2:1942195/FULLTEXT01.pdf
Get Started with PointNet++ - MathWorks, accessed on April 13, 2025, https://www.mathworks.com/help/lidar/ug/get-started-pointnetplus.html
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space - arXiv, accessed on April 13, 2025, https://arxiv.org/abs/1706.02413
You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module - UC Berkeley EECS, accessed on April 13, 2025, https://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-35.pdf
ALS Point Cloud Classification using PointNet++ and KPConv with Prior Knowledge, accessed on April 13, 2025, https://isprs-archives.copernicus.org/articles/XLVI-4-W4-2021/91/2021/isprs-archives-XLVI-4-W4-2021-91-2021.pdf
Leveraging PointNet and PointNet++ for Lyft Point Cloud Classification Challenge - arXiv, accessed on April 13, 2025, https://arxiv.org/html/2404.18665v1
Ground Awareness in Deep Learning for Large Outdoor Point Cloud Segmentation - arXiv, accessed on April 13, 2025, https://arxiv.org/html/2501.18246v1
randlanet - MathWorks, accessed on April 13, 2025, https://www.mathworks.com/help/lidar/ref/randlanet.html
Point cloud classification using RandLA-Net | ArcGIS API for Python, accessed on April 13, 2025, https://developers.arcgis.com/python/latest/guide/point-cloud-classification-using-randlanet/
RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds (CVPR 2020 Oral) - ResearchGate, accessed on April 13, 2025, https://www.researchgate.net/publication/337560140_RandLA-Net_Efficient_Semantic_Segmentation_of_Large-Scale_Point_Clouds_CVPR_2020_Oral
An Improved RandLa-Net Algorithm Incorporated with NDT for Automatic Classification and Extraction of Raw Point Cloud Data - MDPI, accessed on April 13, 2025, https://www.mdpi.com/2079-9292/11/17/2795
[1911.11236] RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds, accessed on April 13, 2025, https://arxiv.org/abs/1911.11236
(PDF) RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds, accessed on April 13, 2025, https://www.researchgate.net/publication/343455209_RandLA-Net_Efficient_Semantic_Segmentation_of_Large-Scale_Point_Clouds
[2312.11880] Point Cloud Segmentation Using Transfer Learning with RandLA-Net: A Case Study on Urban Areas - arXiv, accessed on April 13, 2025, https://arxiv.org/abs/2312.11880
A point cloud semantic segmentation method for nuclear power reactors based on RandLA-Net Model - SciOpen, accessed on April 13, 2025, https://www.sciopen.com/article/10.51393/j.jamst.2023010
Data Study Group Final Report: SenSat - The Alan Turing Institute, accessed on April 13, 2025, https://www.turing.ac.uk/sites/default/files/2020-06/the_alan_turing_institute_data_study_group_final_report_-_sensat_0.pdf
geometry.stanford.edu, accessed on April 13, 2025, https://geometry.stanford.edu/lgl_2024/papers/tqdmgg-KPconv-iccv19/tqdmgg-KPconv-iccv19.pdf
KPConvX: Modernizing Kernel Point Convolution with Kernel Attention - CVF Open Access, accessed on April 13, 2025, http://openaccess.thecvf.com/content/CVPR2024/papers/Thomas_KPConvX_Modernizing_Kernel_Point_Convolution_with_Kernel_Attention_CVPR_2024_paper.pdf
KPConvX: Modernizing Kernel Point Convolution with Kernel Attention - arXiv, accessed on April 13, 2025, https://arxiv.org/html/2405.13194v1
PointConvFormer: Revenge of the Point-based Convolution - Apple Machine Learning Research, accessed on April 13, 2025, https://machinelearning.apple.com/research/pointconvformer
Multi-view KPConv For Enhanced 3D Point Cloud Semantic Segmentation Using Multi-Modal Fusion With 2D Images - mediaTUM, accessed on April 13, 2025, https://mediatum.ub.tum.de/doc/1691326/ktkst0yuqrlvdgdca7izlkqm7.Du_2022_MV-KPConv.pdf
IPCONV: Convolution with Multiple Different Kernels for Point Cloud Semantic Segmentation, accessed on April 13, 2025, https://www.mdpi.com/2072-4292/15/21/5136
Point Cloud Segmentation | Papers With Code, accessed on April 13, 2025, https://paperswithcode.com/task/point-cloud-segmentation
Point Transformer V3: Simpler, Faster, Stronger - arXiv, accessed on April 13, 2025, https://arxiv.org/html/2312.10035v1
Point Transformer | Papers With Code, accessed on April 13, 2025, https://paperswithcode.com/paper/point-transformer-1
openaccess.thecvf.com, accessed on April 13, 2025, https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_Point_Transformer_ICCV_2021_paper.pdf
PCT: Point Cloud Transformer - Tsinghua Graphics and Geometric Computing Group, accessed on April 13, 2025, https://cg.cs.tsinghua.edu.cn/papers/PCT.pdf
Tutorial for 3D Semantic Segmentation with Superpoint Transformer - 3D Geodata Academy, accessed on April 13, 2025, https://learngeodata.eu/tutorial-for-3d-semantic-segmentation-with-superpoint-transformer/
Point Transformer V2: Grouped Vector Attention and Partition-based Pooling, accessed on April 13, 2025, https://proceedings.neurips.cc/paper_files/paper/2022/hash/d78ece6613953f46501b958b7bb4582f-Abstract-Conference.html
Semantic Segmentation of Point Cloud Sequences using Point Transformer v3 - Digital Commons@Kennesaw State, accessed on April 13, 2025, https://digitalcommons.kennesaw.edu/cgi/viewcontent.cgi?article=1006&context=masterstheses
[2305.19812] A Survey of Label-Efficient Deep Learning for 3D Point Clouds - arXiv, accessed on April 13, 2025, https://arxiv.org/abs/2305.19812
Semantic Point Cloud Segmentation Using Fast Deep Neural Network and DCRF - PMC, accessed on April 13, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC8068939/
Advancements in Point Cloud-Based 3D Defect Detection and Classification for Industrial Systems: A Comprehensive Survey - arXiv, accessed on April 13, 2025, https://arxiv.org/html/2402.12923v1
PointeNet: A Lightweight Framework for Effective and Efficient Point Cloud Analysis - arXiv, accessed on April 13, 2025, https://arxiv.org/html/2312.12743v1

686
.github/knowledge/2.randla.knowledge.tex vendored Normal file
View File

@ -0,0 +1,686 @@
\documentclass[10pt,twocolumn,letterpaper]{article}
\usepackage{cvpr}
\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{wrapfig}
\usepackage{diagbox}
\usepackage{verbatim}
\usepackage{xcolor}
\usepackage{floatrow}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage[colorlinks,linkcolor=blue,bookmarks=false]{hyperref}
\usepackage{enumitem}
\usepackage{color}
\usepackage{placeins}
\usepackage[toc,page]{appendix}
\newcommand{\ste}[1]{\textcolor{red}{#1}}
\newcommand{\bo}[1]{\textcolor{blue}{#1}}
\newcommand{\hai}[1]{\textcolor{magenta}{#1}}
\newcommand{\acm}[1]{\textcolor{green}{#1}}
\newcommand{\nickname}{RandLA-Net}
\definecolor{Gray}{gray}{0.85}
\newcolumntype{a}{>{\columncolor{Gray}}c}
% Include other packages here, before hyperref.
% If you comment hyperref and then uncomment it, you should delete
% egpaper.aux before re-running latex. (Or just hit 'q' on the first latex
% run, let it finish, and you should be clear).
\cvprfinalcopy % *** Uncomment this line for the final submission
\def\cvprPaperID{8374} % *** Enter the CVPR Paper ID here
\def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}}
% Pages are numbered in submission mode, and unnumbered in camera-ready
\ifcvprfinal\pagestyle{empty}\fi
\begin{document}
\title{RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds}
\author{Qingyong Hu\textsuperscript{1}, Bo Yang\textsuperscript{1\thanks{Corresponding author}*}, Linhai Xie\textsuperscript{1}, Stefano Rosa\textsuperscript{1}, Yulan Guo\textsuperscript{2,3}, \\Zhihua Wang\textsuperscript{1}, Niki Trigoni\textsuperscript{1}, Andrew Markham\textsuperscript{1} \\
\textsuperscript{1}University of Oxford, \textsuperscript{2}Sun Yat-sen University, \textsuperscript{3}National University of Defense Technology\\
{\tt\small firstname.lastname@cs.ox.ac.uk}}
\maketitle
%\thispagestyle{empty}
\begin{abstract}
We study the problem of efficient semantic segmentation for large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/post-processing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce \textbf{\nickname{}}, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Extensive experiments show that our \nickname{} can process 1 million points in a single pass with up to 200$\times$ faster than existing approaches. Moreover, our \nickname{} clearly surpasses state-of-the-art approaches for semantic segmentation on two large-scale benchmarks Semantic3D and SemanticKITTI.
\end{abstract}
\section{Introduction}
\label{sec:Intro}
Efficient semantic segmentation of large-scale 3D point clouds is a fundamental and essential capability for real-time intelligent systems, such as autonomous driving and augmented reality. A key challenge is that the raw point clouds acquired by depth sensors are typically irregularly sampled, unstructured and unordered. Although deep convolutional networks show excellent performance in structured 2D computer vision tasks, they cannot be directly applied to this type of unstructured data.
\begin{figure}[t]
\label{fig:illustration}
\centering
\includegraphics[width=1.0\textwidth]{figs/Fig1.pdf}
\caption{Semantic segmentation results of PointNet++ \cite{qi2017pointnet++}, SPG \cite{landrieu2018large} and our approach on SemanticKITTI \cite{behley2019semantickitti}. Our \nickname{} takes only 0.04s to directly process a large point cloud with $10^5$ points over 150$\times$130$\times$10 meters in 3D space, which is up to 200$\times$ faster than SPG. Red circles highlight the superior segmentation accuracy of our approach.}
\vspace{-0.2cm}
\end{figure}
Recently, the pioneering work PointNet \cite{qi2017pointnet} has emerged as a promising approach for directly processing 3D point clouds. It learns per-point features using shared multilayer perceptrons (MLPs). This is computationally efficient but fails to capture wider context information for each point. To learn richer local structures, many dedicated neural modules have been subsequently and rapidly introduced. These modules can be generally categorized as: 1) neighbouring feature pooling \cite{qi2017pointnet++, so-net, RSNet, pointweb, zhang2019shellnet}, 2) graph message passing \cite{dgcnn, KCNet,local_spectral,GACNet, clusternet, HPEIN, Agglomeration}, 3) kernel-based convolution \cite{su2018splatnet, hua2018pointwise, wu2018pointconv, octree_guided, ACNN, Geo-CNN, thomas2019kpconv, mao2019interpolated}, and 4) attention-based aggregation \cite{xie2018attentional, PCAN, Yang2019ModelingPC, AttentionalPointNet}. Although these approaches achieve impressive results for object recognition and semantic segmentation, almost all of them are limited to extremely small 3D point clouds (e.g., 4k points or 1$\times$1 meter blocks) and cannot be directly extended to larger point clouds (e.g., millions of points and up to 200$\times$200 meters) without preprocessing steps such as block partition. The reasons for this limitation are three-fold. 1) The commonly used point-sampling methods of these networks are either computationally expensive or memory inefficient. For example, the widely employed farthest-point sampling \cite{qi2017pointnet++} takes over 200 seconds to sample 10\% of 1 million points. 2) Most existing local feature learners usually rely on computationally expensive kernelisation or graph construction, thereby being unable to process massive number of points. 3) For a large-scale point cloud, which usually consists of hundreds of objects, the existing local feature learners are either incapable of capturing complex structures, or do so inefficiently, due to their limited size of receptive fields.
A handful of recent works have started to tackle the task of directly processing large-scale point clouds. SPG \cite{landrieu2018large} preprocesses the large point clouds as super graphs before applying neural networks to learn per super-point semantics. Both FCPN \cite{rethage2018fully} and PCT \cite{PCT} combine voxelization and point-level networks to process massive point clouds. Although they achieve decent segmentation accuracy, the preprocessing and voxelization steps are too computationally heavy to be deployed in real-time applications.
In this paper, we aim to design a memory and computationally efficient neural architecture, which is able to directly process large-scale 3D point clouds in a single pass, without requiring any pre/post-processing steps such as voxelization, block partitioning or graph construction. However, this task is extremely challenging as it requires: 1) a memory and computationally efficient sampling approach to progressively downsample large-scale point clouds to fit in the limits of current GPUs, and 2) an effective local feature learner to progressively increase the receptive field size to preserve complex geometric structures. To this end, we first systematically demonstrate that \textbf{random sampling} is a key enabler for deep neural networks to efficiently process large-scale point clouds. However, random sampling can discard key information, especially for objects with sparse points.
To counter the potentially detrimental impact of random sampling, we propose a new and efficient \textbf{local feature aggregation module} to capture complex local structures over progressively smaller point-sets.
Amongst existing sampling methods, farthest point sampling and inverse density sampling are the most frequently used for small-scale point clouds \cite{qi2017pointnet++, wu2018pointconv, li2018pointcnn, pointweb, Groh2018flexconv}. As point sampling is such a fundamental step within these networks, we investigate the relative merits of different approaches in Section \ref{Sub-sampling}, where we see that the commonly used sampling methods limit scaling towards large point clouds, and act as a significant bottleneck to real-time processing. However, we identify random sampling as by far the most suitable component for large-scale point cloud processing as it is fast and scales efficiently. Random sampling is not without cost, because prominent point features may be dropped by chance and it cannot be used directly in existing networks without incurring a performance penalty. To overcome this issue, we design a new local feature aggregation module in Section \ref{LFA}, which is capable of effectively learning complex local structures by progressively increasing the receptive field size in each neural layer. In particular, for each 3D point, we firstly introduce a local spatial encoding (LocSE) unit to explicitly preserve local geometric structures. Secondly, we leverage attentive pooling to automatically keep the useful local features. Thirdly, we stack multiple LocSE units and attentive poolings as a dilated residual block, greatly increasing the effective receptive field for each point. Note that all these neural components are implemented as shared MLPs, and are therefore remarkably memory and computational efficient.
Overall, being built on the principles of simple \textbf{rand}om sampling and an effective \textbf{l}ocal feature \textbf{a}ggregator, our efficient neural architecture, named \textbf{RandLA-Net}, not only is up to 200$\times$ faster than existing approaches on large-scale point clouds, but also surpasses the state-of-the-art semantic segmentation methods on both Semantic3D \cite{Semantic3D} and SemanticKITTI \cite{behley2019semantickitti} benchmarks. Figure \ref{fig:illustration} shows qualitative results of our approach. Our key contributions are:
\begin{itemize}[leftmargin=*]
\item We analyse and compare existing sampling approaches, identifying random sampling as the most suitable component for efficient learning on large-scale point clouds.
\item We propose an effective local feature aggregation module to preserve complex local structures by progressively increasing the receptive field for each point.
\item We demonstrate significant memory and computational gains over baselines, and surpass the state-of-the-art semantic segmentation methods on multiple large-scale benchmarks.
\end{itemize}
\section{Related Work}
To extract features from 3D point clouds, traditional approaches usually rely on hand-crafted features \cite{point_signatures, fast_hist, landrieu2017structured, hackel2016fast}. Recent learning based approaches \cite{guo2019deep, qi2017pointnet, Point_voxel_cnn} mainly include projection-based, voxel-based and point-based schemes which are outlined here.
\textbf{(1) Projection and Voxel Based Networks.}
To leverage the success of 2D CNNs, many works \cite{li2016vehicle_rss, chen2017multi, PIXOR, pointpillars} project/flatten 3D point clouds onto 2D images to address the task of object detection. However, geometric details may be lost during the projection. Alternatively, point clouds can be voxelized into 3D grids and then powerful 3D CNNs are applied in \cite{sparse, pointgrid, 4dMinkpwski, vvnet, Fast_point_rcnn}. Although they achieve leading results on semantic segmentation and object detection, their primary limitation is the heavy computation cost, especially when processing large-scale point clouds.
\textbf{(2) Point Based Networks.}
Inspired by PointNet/PointNet++ \cite{qi2017pointnet, qi2017pointnet++}, many recent works introduced sophisticated neural modules to learn per-point local features. These modules can be generally classified as 1) neighbouring feature pooling \cite{so-net, RSNet, pointweb, zhang2019shellnet}, 2) graph message passing \cite{dgcnn, KCNet,local_spectral,GACNet, clusternet, HPEIN, Agglomeration, Li_2019_ICCV}, 3) kernel-based convolution \cite{su2018splatnet, hua2018pointwise, wu2018pointconv, octree_guided, ACNN, Geo-CNN, thomas2019kpconv, mao2019interpolated}, and 4) attention-based aggregation \cite{xie2018attentional, PCAN, Yang2019ModelingPC, AttentionalPointNet}. Although these networks have shown promising results on small point clouds, most of them cannot directly scale up to large scenarios due to their high computational and memory costs. Compared with them, our proposed \nickname{} is distinguished in three ways: 1) it only relies on random sampling within the network, thereby requiring much less memory and computation; 2) the proposed local feature aggregator can obtain successively larger receptive fields by explicitly considering the local spatial relationship and point features, thus being more effective and robust for learning complex local patterns; 3) the entire network only consists of shared MLPs without relying on any expensive operations such as graph construction and kernelisation, therefore being superbly efficient for large-scale point clouds.
\textbf{(3) Learning for Large-scale Point Clouds}.
SPG \cite{landrieu2018large} preprocesses the large point clouds as superpoint graphs to learn per super-point semantics. The recent FCPN \cite{rethage2018fully} and PCT \cite{PCT} apply both voxel-based and point-based networks to process the massive point clouds. However, both the graph partitioning and voxelisation are computationally expensive. In constrast, our \nickname{} is end-to-end trainable without requiring additional pre/post-processing steps.
\section{\nickname{}}
\subsection{Overview}
As illustrated in Figure \ref{fig:sampling}, given a large-scale point cloud with millions of points spanning up to hundreds of meters, to process it with a deep neural network inevitably requires those points to be progressively and efficiently downsampled in each neural layer, without losing the useful point features. In our \nickname{}, we propose to use the simple and fast approach of random sampling to greatly decrease point density, whilst applying a carefully designed local feature aggregator to retain prominent features. This allows the entire network to achieve an excellent trade-off between efficiency and effectiveness.
\begin{figure}[htb]
\centering
\includegraphics[width=\textwidth]{figs/Fig2.pdf}
\caption{In each layer of \nickname{}, the large-scale point cloud is significantly downsampled, yet is capable of retaining features necessary for accurate segmentation.}
\label{fig:sampling}
\end{figure}
\vspace{-0.2cm}
\subsection{The quest for efficient sampling}
\label{Sub-sampling}
\begin{figure*}[thb]
\centering
\includegraphics[width=\textwidth]{figs/Fig3.pdf}
\caption{The proposed local feature aggregation module. The top panel shows the location spatial encoding block that extracts features, and the attentive pooling mechanism that weights the most important neighbouring features, based on the local context and geometry. The bottom panel shows how two of these components are chained together, to increase the receptive field size, within a residual block.}
\label{fig:network}
\vspace{-0.2cm}
\end{figure*}
Existing point sampling approaches \cite{qi2017pointnet++, li2018pointcnn, Groh2018flexconv, learning2sample, concrete, wu2018pointconv} can be roughly classified into heuristic and learning-based approaches. However, there is still no standard sampling strategy that is suitable for large-scale point clouds. Therefore, we analyse and compare their relative merits and complexity as follows.\\ \vspace{-2mm}
\noindent\textbf{(1) Heuristic Sampling}
\vspace{-2mm}
\begin{itemize}[leftmargin=*]
\item\textit{Farthest Point Sampling (FPS):} In order to sample $K$ points from a large-scale point cloud $\boldsymbol{P}$ with $N$ points, FPS returns a reordering of the metric space $\{p_1 \cdots p_k \cdots p_K\}$, such that each $p_k$ is the farthest point from the first $k-1$ points. FPS is widely used in \cite{qi2017pointnet++, li2018pointcnn, wu2018pointconv} for semantic segmentation of small point sets. Although it has a good coverage of the entire point set, its computational complexity is $\mathcal{O}(N^2)$. For a large-scale point cloud ($N \sim 10^6$), FPS takes up to 200 seconds\footnote{We use the same hardware in Sec \ref{subsec:Implementation}, unless specified otherwise.} to process on a single GPU. This shows that FPS is not suitable for large-scale point clouds.
\item\textit{Inverse Density Importance Sampling (IDIS):} To sample $K$ points from $N$ points, IDIS reorders all $N$ points according to the density of each point, after which the top $K$ points are selected \cite{Groh2018flexconv}. Its computational complexity is approximately $\mathcal{O}(N)$. Empirically, it takes 10 seconds to process $10^6$ points. Compared with FPS, IDIS is more efficient, but also more sensitive to outliers. However, it is still too slow for use in a real-time system.
\item \textit{Random Sampling (RS):} Random sampling uniformly selects $K$ points from the original $N$ points. Its computational complexity is $\mathcal{O}(1)$, which is agnostic to the total number of input points, i.e., it is constant-time and hence inherently scalable. Compared with FPS and IDIS, random sampling has the highest computational efficiency, regardless of the scale of input point clouds. It only takes 0.004s to process $10^6$ points.
\end{itemize}
\noindent\textbf{(2) Learning-based Sampling}
\vspace{-2.5mm}
\begin{itemize}[leftmargin=*]
\item \textit{Generator-based Sampling (GS):} GS \cite{learning2sample} learns to generate a small set of points to approximately represent the original large point set. However, FPS is usually used in order to match the generated subset with the original set at inference stage, incurring additional computation. In our experiments, it takes up to 1200 seconds to sample 10\% of $10^6$ points.
\item \textit{Continuous Relaxation based Sampling (CRS):}
CRS approaches \cite{concrete, Yang2019ModelingPC} use the reparameterization trick to relax the sampling operation to a continuous domain for end-to-end training. In particular, each sampled point is learnt based on a weighted sum over the full point clouds. It results in a large weight matrix when sampling all the new points simultaneously with a one-pass matrix multiplication, leading to an unaffordable memory cost. For example, it is estimated to take more than a 300 GB memory footprint to sample 10\% of $10^6$ points.
\item \textit{Policy Gradient based Sampling (PGS):} PGS formulates the sampling operation as a Markov decision process \cite{show_attend}. It sequentially learns a probability distribution to sample the points. However, the learnt probability has high variance due to the extremely large exploration space when the point cloud is large. For example, to sample 10\% of $10^6$ points, the exploration space is $\mathrm{C}_{10^{6}}^{10^{5}}$ and it is unlikely to learn an effective sampling policy. We empirically find that the network is difficult to converge if PGS is used for large point clouds.
\end{itemize}
Overall, FPS, IDIS and GS are too computationally expensive to be applied for large-scale point clouds. CRS approaches have an excessive memory footprint and PGS is hard to learn. By contrast, random sampling has the following two advantages: 1) it is remarkably computational efficient as it is agnostic to the total number of input points, 2) it does not require extra memory for computation. Therefore, we safely conclude that random sampling is by far the most suitable approach to process large-scale point clouds compared with all existing alternatives. However, random sampling may result in many useful point features being dropped. To overcome it, we propose a powerful local feature aggregation module as presented in the next section.
\subsection{Local Feature Aggregation}
\label{LFA}
As shown in Figure \ref{fig:network}, our local feature aggregation module is applied to each 3D point in parallel and it consists of three neural units: 1) local spatial encoding (LocSE), 2) attentive pooling, and 3) dilated residual block.
\noindent\textbf{(1) Local Spatial Encoding}\\
Given a point cloud $\boldsymbol{P}$ together with per-point features (e.g., raw RGB, or intermediate learnt features), this local spatial encoding unit explicitly embeds the x-y-z coordinates of all neighbouring points, such that the corresponding point features are always aware of their relative spatial locations. This allows the LocSE unit to explicitly observe the local geometric patterns, thus eventually benefiting the entire network to effectively learn complex local structures. In particular, this unit includes the following steps:
\textit{Finding Neighbouring Points.} For the $i^{th}$ point, its neighbouring points are firstly gathered by the simple $K$-nearest neighbours (KNN) algorithm for efficiency. The KNN is based on the point-wise Euclidean distances.
\textit{Relative Point Position Encoding.} For each of the nearest $K$ points $\{p_i^1 \cdots p_i^k \cdots p_i^K\}$ of the center point $p_i$, we explicitly encode the relative point position as follows:
\begin{equation}
\mathbf{r}_{i}^{k} = MLP\Big(p_i \oplus p_i^k \oplus (p_i-p_i^k) \oplus ||p_i-p_i^k||\Big)
\label{Eq1}
\end{equation}
where $p_i$ and $p_i^k$ are the x-y-z positions of points, $\oplus$ is the concatenation operation, and $||\cdot||$ calculates the Euclidean distance between the neighbouring and center points. It seems that $\mathbf{r}_{i}^{k}$ is encoded from redundant point positions. Interestingly, this tends to aid the network to learn local features and obtains good performance in practice.
\textit{Point Feature Augmentation.} For each neighbouring point $p_i^k$, the encoded relative point positions $\mathbf{r}_{i}^{k}$ are concatenated with its corresponding point features $\mathbf{f}_i^k$, obtaining an augmented feature vector $\mathbf{\hat{f}}_i^k$.
Eventually, the output of the LocSE unit is a new set of neighbouring features $\mathbf{\hat{F}}_i = \{\mathbf{\hat{f}}_i^1 \cdots \mathbf{\hat{f}}_i^k \cdots \mathbf{\hat{f}}_i^K \}$, which explicitly encodes the local geometric structures for the center point $p_i$. We notice that the recent work \cite{liu2019relation} also uses point positions to improve semantic segmentation. However, the positions are used to learn point scores in \cite{liu2019relation}, while our LocSE explicitly encodes the relative positions to augment the neighbouring point features.
\noindent\textbf{(2) Attentive Pooling}\\
This neural unit is used to aggregate the set of neighbouring point features $\mathbf{\hat{F}}_i$. Existing works \cite{qi2017pointnet++, li2018pointcnn} typically use max/mean pooling to hard integrate the neighbouring features, resulting in the majority of the information being lost. By contrast, we turn to the powerful attention mechanism to automatically learn important local features. In particular, inspired by \cite{Yang_ijcv2019}, our attentive pooling unit consists of the following steps.
\textit{Computing Attention Scores.} Given the set of local features $\mathbf{\hat{F}}_i = \{\mathbf{\hat{f}}_i^1 \cdots \mathbf{\hat{f}}_i^k \cdots \mathbf{\hat{f}}_i^K \}$, we design a shared function $g()$ to learn a unique attention score for each feature. Basically, the function $g()$ consists of a shared MLP followed by $softmax$. It is formally defined as follows:
\begin{equation}
\mathbf{s}_{i}^{k} = g(\mathbf{\hat{f}}_i^k, \boldsymbol{W})
\label{Eq2}
\end{equation}
where $\boldsymbol{W}$ is the learnable weights of a shared MLP.
\textit{Weighted Summation.} The learnt attention scores can be regarded as a soft mask which automatically selects the important features. Formally, these features are weighted summed as follows:
\begin{equation}
\mathbf{\Tilde{f}}_{i} = \sum_{k=1}^{K}(\mathbf{\hat{f}}_i^k \cdot \mathbf{s}_{i}^{k})
\end{equation}
To summarize, given the input point cloud $\boldsymbol{P}$, for the $i^{th}$ point $p_i$, our LocSE and Attentive Pooling units learn to aggregate the geometric patterns and features of its $K$ nearest points, and finally generate an informative feature vector $\mathbf{\Tilde{f}}_{i}$.
\noindent\textbf{(3) Dilated Residual Block}\\
Since the large point clouds are going to be substantially downsampled, it is desirable to significantly increase the receptive field for each point, such that the geometric details of input point clouds are more likely to be reserved, even if some points are dropped. As shown in Figure \ref{fig:network}, inspired by the successful ResNet \cite{he2016deep} and the effective dilated networks \cite{DPC}, we stack multiple LocSE and Attentive Pooling units with a skip connection as a dilated residual block.
To further illustrate the capability of our dilated residual block, Figure \ref{fig:Residual} shows that the red 3D point observes $K$ neighbouring points after the first LocSE/Attentive Pooling operation, and then is able to receive information from up to $K^2$ neighbouring points i.e. its two-hop neighbourhood after the second. This is a cheap way of dilating the receptive field and expanding the effective neighbourhood through feature propagation. Theoretically, the more units we stack, the more powerful this block as its sphere of reach becomes greater and greater. However, more units would inevitably sacrifice the overall computation efficiency. In addition, the entire network is likely to be over-fitted. In our \nickname{}, we simply stack two sets of LocSE and Attentive Pooling as the standard residual block, achieving a satisfactory balance between efficiency and effectiveness.
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{figs/Fig4.pdf}
\caption{Illustration of the dilated residual block which significantly increases the receptive field (dotted circle) of each point, colored points represent the aggregated features. L: Local spatial encoding, A: Attentive pooling.}
\label{fig:Residual}
\end{figure}
Overall, our local feature aggregation module is designed to effectively preserve complex local structures via explicitly considering neighbouring geometries and significantly increasing receptive fields. Moreover, this module only consists of feed-forward MLPs, thus being computationally efficient.
\subsection{Implementation}
\label{subsec:Implementation}
We implement \nickname{} by stacking multiple local feature aggregation modules and random sampling layers. The detailed architecture is presented in the Appendix. We use the Adam optimizer with default parameters. The initial learning rate is set as 0.01 and decreases by 5\% after each epoch. The number of nearest points $K$ is set as 16. To train our \nickname{} in parallel, we sample a fixed number of points ($\sim 10^5$) from each point cloud as the input. During testing, the whole raw point cloud is fed into our network to infer per-point semantics without pre/post-processing such as geometrical or block partition. All experiments are conducted on an NVIDIA RTX2080Ti GPU.
\section{Experiments}
\begin{comment}
\end{comment}
\subsection{Efficiency of Random Sampling}
\label{sec:eff_sampling}
In this section, we empirically evaluate the efficiency of existing sampling approaches including FPS, IDIS, RS, GS, CRS, and PGS, which have been discussed in Section \ref{Sub-sampling}. In particular, we conduct the following 4 groups of experiments.
\begin{itemize}[leftmargin=*]
\item Group 1. Given a small-scale point cloud ($\sim 10^3$ points), we use each sampling approach to progressively downsample it. Specifically, the point cloud is downsampled by five steps with only 25\% points being retained in each step on a single GPU i.e. a four-fold decimation ratio. This means that there are only $\sim (1/4)^5 \times 10^3$ points left in the end. This downsampling strategy emulates the procedure used in PointNet++ \cite{qi2017pointnet++}. For each sampling approach, we sum up its time and memory consumption for comparison.
\item Group 2/3/4. The total number of points are increased towards large-scale, i.e., around $10^4, 10^5$ and $10^6$ points respectively. We use the same five sampling steps as in Group 1.
\end{itemize}
\textbf{Analysis.} Figure \ref{fig:sampling_comparison} compares the total time and memory consumption of each sampling approach to process different scales of point clouds. It can be seen that: 1) For small-scale point clouds ($\sim 10^3$), all sampling approaches tend to have similar time and memory consumption, and are unlikely to incur a heavy or limiting computation burden. 2) For large-scale point clouds ($\sim 10^6$), FPS/IDIS/GS/CRS/PGS are either extremely time-consuming or memory-costly. By contrast, random sampling has superior time and memory efficiency overall. This result clearly demonstrates that most existing networks \cite{qi2017pointnet++, li2018pointcnn, wu2018pointconv, liu2019relation, pointweb, Yang2019ModelingPC} are only able to be optimized on small blocks of point clouds primarily because they rely on the expensive sampling approaches. Motivated by this, we use the efficient random sampling strategy in our \nickname{}.
\subsection{Efficiency of \nickname{}}
\label{sec:eff_net}
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{figs/Fig5.pdf}
\caption{Time and memory consumption of different sampling approaches. The dashed lines represent estimated values due to the limited GPU memory.}
\label{fig:sampling_comparison}
\vspace{-0.2cm}
\end{figure}
In this section, we systematically evaluate the overall efficiency of our \nickname{} on real-world large-scale point clouds for semantic segmentation. Particularly, we evaluate \nickname{} on the SemanticKITTI \cite{behley2019semantickitti} dataset, obtaining the total time consumption of our network on Sequence 08 which has 4071 scans of point clouds in total. We also evaluate the time consumption of recent representative works \cite{qi2017pointnet, qi2017pointnet++, li2018pointcnn, landrieu2018large, thomas2019kpconv} on the same dataset. For a fair comparison, we feed the same number of points (i.e., 81920) from each scan into each neural network.
In addition, we also evaluate the memory consumption of \nickname{} and the baselines. In particular, we not only report the total number of parameters of each network, but also measure the maximum number of 3D points each network can take as input in a single pass to infer per-point semantics. Note that, all experiments are conducted on the same machine with an AMD 3700X @3.6GHz CPU and an NVIDIA RTX2080Ti GPU.
\textbf{Analysis.} Table \ref{tab:efficiency} quantitatively shows the total time and memory consumption of different approaches. It can be seen that, 1) SPG \cite{landrieu2018large} has the lowest number of network parameters, but takes the longest time to process the point clouds due to the expensive geometrical partitioning and super-graph construction steps; 2) both PointNet++ \cite{qi2017pointnet++} and PointCNN \cite{li2018pointcnn} are also computationally expensive mainly because of the FPS sampling operation; 3) PointNet \cite{qi2017pointnet} and KPConv \cite{thomas2019kpconv} are unable to take extremely large-scale point clouds (e.g. $10^6$ points) in a single pass due to their memory inefficient operations. 4) Thanks to the simple random sampling together with the efficient MLP-based local feature aggregator, our \nickname{} takes the shortest time (185 seconds averaged by 4071 frames $\rightarrow$ roughly 22 FPS) to infer the semantic labels for each large-scale point cloud (up to $10^6$ points).
\begin{table}[tb]
\centering
\caption{The computation time, network parameters and maximum number of input points of different approaches for semantic segmentation on Sequence 08 of the SemanticKITTI \cite{behley2019semantickitti} dataset.}
\label{tab:efficiency}
\resizebox{\textwidth}{!}{%
\begin{tabular}{rccc}
\toprule[1.0pt]
& \begin{tabular}[c]{@{}c@{}}Total time\\ (seconds)\end{tabular} & \begin{tabular}[c]{@{}c@{}}Parameters\\ (millions)\end{tabular} & \begin{tabular}[c]{@{}c@{}}Maximum inference\\ points (millions)\end{tabular} \\
\toprule[1.0pt]
PointNet (Vanilla) \cite{qi2017pointnet} & 192 & 0.8 & 0.49 \\
PointNet++ (SSG) \cite{qi2017pointnet++} & 9831 & 0.97 & 0.98 \\
PointCNN \cite{li2018pointcnn} & 8142 & 11 & 0.05 \\
SPG \cite{landrieu2018large} & 43584 & \textbf{0.25} & - \\
KPConv \cite{thomas2019kpconv} &717 &14.9 &0.54 \\
\textbf{\nickname{} (Ours)} & \textbf{185} & 1.24 & \textbf{1.03} \\
\toprule[1.0pt]
\end{tabular}%
}
\end{table}
\subsection{Semantic Segmentation on Benchmarks}
\label{sec:sem_seg}
\begin{table*}[htb]
\centering
\caption{Quantitative results of different approaches on Semantic3D (reduced-8) \cite{Semantic3D}. Only the recent published approaches are compared. Accessed on 31 March 2020.}
\label{tab:reduced-8}
\resizebox{0.9\textwidth}{!}{%
\begin{tabular}{rcccccccccc}
\toprule[1.0pt]
& mIoU (\%) & OA (\%) & man-made. & natural. & high veg. & low veg. & buildings & hard scape & scanning art. & cars \\
\toprule[1.0pt]
SnapNet\_ \cite{snapnet} & 59.1 & 88.6 & 82.0 & 77.3 & 79.7 & 22.9 & 91.1 & 18.4 & 37.3 & 64.4 \\
SEGCloud \cite{tchapmi2017segcloud} & 61.3 & 88.1 & 83.9 & 66.0 & 86.0 & 40.5 & 91.1 & 30.9 & 27.5 & 64.3 \\
RF\_MSSF \cite{RF_MSSF} & 62.7 & 90.3 & 87.6 & 80.3 & 81.8 & 36.4 & 92.2 & 24.1 & 42.6 & 56.6 \\
MSDeepVoxNet \cite{msdeepvoxnet} & 65.3 & 88.4 & 83.0 & 67.2 & 83.8 & 36.7 & 92.4 & 31.3 & 50.0 & 78.2 \\
ShellNet \cite{zhang2019shellnet} & 69.3 & 93.2 & 96.3 & 90.4 & 83.9 & 41.0 & 94.2 & 34.7 & 43.9 & 70.2 \\
GACNet \cite{GACNet} & 70.8 & 91.9 & 86.4 & 77.7 & \textbf{88.5} & \textbf{60.6} & 94.2 & 37.3 & 43.5 & 77.8 \\
SPG \cite{landrieu2018large} & 73.2 & 94.0 & \textbf{97.4} & \textbf{92.6} & 87.9 & 44.0 & 83.2 & 31.0 & 63.5 & 76.2 \\
KPConv \cite{thomas2019kpconv} & 74.6 & 92.9 & 90.9 & 82.2 & 84.2 & 47.9 & 94.9 & 40.0 & \textbf{77.3} & \textbf{79.7} \\
\textbf{\nickname{} (Ours)} & \textbf{77.4} & \textbf{94.8} & 95.6 & 91.4 & 86.6 & 51.5 & \textbf{95.7} & \textbf{51.5} & 69.8 & 76.8
\\
\bottomrule[1.0pt]
\end{tabular}%
}
\end{table*}
In this section, we evaluate the semantic segmentation of our \nickname{} on three large-scale public datasets: the outdoor Semantic3D \cite{Semantic3D} and SemanticKITTI \cite{behley2019semantickitti}, and the indoor S3DIS \cite{2D-3D-S}.
\begin{table*}[htb]
\centering
\caption{Quantitative results of different approaches on SemanticKITTI \cite{behley2019semantickitti}. Only the recent published methods are compared and all scores are obtained from the online single scan evaluation track. Accessed on 31 March 2020.}
\label{tab:SemanticKITTI}
\resizebox{\textwidth}{!}{%
\begin{tabular}{rcccccccccccccccccccccc}
\toprule[1.0pt]
Methods & Size & \rotatebox{90}{\textbf{mIoU(\%)}} & \rotatebox{90}{Params(M)} & \rotatebox{90}{road} & \rotatebox{90}{sidewalk} & \rotatebox{90}{parking} & \rotatebox{90}{other-ground} & \rotatebox{90}{building} & \rotatebox{90}{car} & \rotatebox{90}{truck} & \rotatebox{90}{bicycle} & \rotatebox{90}{motorcycle} & \rotatebox{90}{other-vehicle} & \rotatebox{90}{vegetation} & \rotatebox{90}{trunk} & \rotatebox{90}{terrain} & \rotatebox{90}{person} & \rotatebox{90}{bicyclist} & \rotatebox{90}{motorcyclist} & \rotatebox{90}{fence} & \rotatebox{90}{pole} & \rotatebox{90}{traffic-sign} \\
\toprule[1.0pt]
PointNet \cite{qi2017pointnet} & \multirow{5}{*}{50K pts} & 14.6 & 3 & 61.6 & 35.7 & 15.8 & 1.4 & 41.4 & 46.3 & 0.1 & 1.3 & 0.3 & 0.8 & 31.0 & 4.6 & 17.6 & 0.2 & 0.2 & 0.0 & 12.9 & 2.4 & 3.7 \\
SPG \cite{landrieu2018large} & & 17.4 & \textbf{0.25} & 45.0 & 28.5 & 0.6 & 0.6 & 64.3 & 49.3 & 0.1 & 0.2 & 0.2 & 0.8 & 48.9 & 27.2 & 24.6 & 0.3 & 2.7 & 0.1 & 20.8 & 15.9 & 0.8 \\
SPLATNet \cite{su2018splatnet} & & 18.4 & 0.8 & 64.6 & 39.1 & 0.4 & 0.0 & 58.3 & 58.2 & 0.0 & 0.0 & 0.0 & 0.0 & 71.1 & 9.9 & 19.3 & 0.0 & 0.0 & 0.0 & 23.1 & 5.6 & 0.0 \\
PointNet++ \cite{qi2017pointnet++} & & 20.1 & 6 & 72.0 & 41.8 & 18.7 & 5.6 & 62.3 & 53.7 & 0.9 & 1.9 & 0.2 & 0.2 & 46.5 & 13.8 & 30.0 & 0.9 & 1.0 & 0.0 & 16.9 & 6.0 & 8.9 \\
TangentConv \cite{tangentconv} & & 40.9 & 0.4 & 83.9 & 63.9 & 33.4 & 15.4 & 83.4 & 90.8 & 15.2 & 2.7 & 16.5 & 12.1 & 79.5 & 49.3 & 58.1 & 23.0 & 28.4 & \textbf{8.1} & 49.0 & 35.8 & 28.5 \\
\toprule[1.0pt]
SqueezeSeg \cite{wu2018squeezeseg} & \multirow{5}{*}{\begin{tabular}[c]{@{}c@{}}64*2048\\ pixels\end{tabular}} & 29.5 & 1 & 85.4 & 54.3 & 26.9 & 4.5 & 57.4 & 68.8 & 3.3 & 16.0 & 4.1 & 3.6 & 60.0 & 24.3 & 53.7 & 12.9 & 13.1 & 0.9 & 29.0 & 17.5 & 24.5 \\
SqueezeSegV2 \cite{wu2019squeezesegv2} & & 39.7 & 1 & 88.6 & 67.6 & 45.8 & 17.7 & 73.7 & 81.8 & 13.4 & 18.5 & 17.9 & 14.0 & 71.8 & 35.8 & 60.2 & 20.1 & 25.1 & 3.9 & 41.1 & 20.2 & 36.3 \\
DarkNet21Seg \cite{behley2019semantickitti} & & 47.4 & 25 & 91.4 & 74.0 & 57.0 & 26.4 & 81.9 & 85.4 & 18.6 & \textbf{26.2} & 26.5 & 15.6 & 77.6 & 48.4 & 63.6 & 31.8 & 33.6 & 4.0 & 52.3 & 36.0 & 50.0 \\
DarkNet53Seg \cite{behley2019semantickitti} & & 49.9 & 50 & \textbf{91.8} & 74.6 & 64.8 & \textbf{27.9} & 84.1 & 86.4 & 25.5 & 24.5 & 32.7 & 22.6 & 78.3 & 50.1 & 64.0 & 36.2 & 33.6 & 4.7 & 55.0 & 38.9 & 52.2 \\
RangeNet53++ \cite{rangenet++} & & 52.2 & 50 & \textbf{91.8} & \textbf{75.2} & \textbf{65.0} & 27.8 & \textbf{87.4} & 91.4 & 25.7 & 25.7 & \textbf{34.4} & 23.0 & 80.5 & 55.1 & 64.6 &38.3 & 38.8 & 4.8 & \textbf{58.6} & 47.9 & \textbf{55.9} \\
\toprule[1.0pt]
\textbf{\nickname{} (Ours)} & 50K pts & \textbf{53.9} & 1.24 & 90.7 & 73.7 & 60.3 & 20.4 & 86.9 & \textbf{94.2} & \textbf{40.1} & 26.0 & 25.8 & \textbf{38.9} & \textbf{81.4} & \textbf{61.3} & \textbf{66.8} & \textbf{49.2} & \textbf{48.2} & 7.2 & 56.3 & \textbf{49.2} & 47.7 \\
\toprule[1.0pt]
\end{tabular}%
}
\end{table*}
\label{sec:result}
\begin{figure*}[hbt!]
\centering
\includegraphics[width=1\textwidth]{figs/Fig6.pdf}
\caption{Qualitative results of \nickname{} on the validation set of SemanticKITTI \cite{behley2019semantickitti}. Red circles show the failure cases.}
\label{fig:semanticKITTI}
\end{figure*}
\noindent\textbf{(1) Evaluation on Semantic3D.} The Semantic3D dataset \cite{Semantic3D} consists of 15 point clouds for training and 15 for online testing. Each point cloud has up to $10^{8}$ points, covering up to 160$\times$240$\times$30 meters in real-world 3D space. The raw 3D points belong to 8 classes and contain 3D coordinates, RGB information, and intensity. We only use the 3D coordinates and color information to train and test our \nickname{}. Mean Intersection-over-Union (mIoU) and Overall Accuracy (OA) of all classes are used as the standard metrics. For fair comparison, we only include the results of recently published strong baselines \cite{snapnet, tchapmi2017segcloud, RF_MSSF, msdeepvoxnet, zhang2019shellnet, GACNet, landrieu2018large} and the current state-of-the-art approach KPConv \cite{thomas2019kpconv}.
Table \ref{tab:reduced-8} presents the quantitative results of different approaches. \nickname{} clearly outperforms all existing methods in terms of both mIoU and OA. Notably, \nickname{} also achieves superior performance on six of the eight classes, except \textit{low vegetation} and \textit{scanning art.}.
\noindent\textbf{(2) Evaluation on SemanticKITTI.} SemanticKITTI \cite{behley2019semantickitti} consists of 43552 densely annotated LIDAR scans belonging to 21 sequences. Each scan is a large-scale point cloud with $\sim 10^{5}$ points and spanning up to 160$\times$160$\times$20 meters in 3D space. Officially, the sequences 00$\sim$07 and 09$\sim$10 (19130 scans) are used for training, the sequence 08 (4071 scans) for validation, and the sequences 11$\sim$21 (20351 scans) for online testing. The raw 3D points only have 3D coordinates without color information. The mIoU score over 19 categories is used as the standard metric.
Table 3 shows a quantitative comparison of our \nickname{} with two families of recent approaches, i.e. 1) point-based methods \cite{qi2017pointnet,landrieu2018large,su2018splatnet,qi2017pointnet++,tangentconv} and 2) projection based approaches \cite{wu2018squeezeseg, wu2019squeezesegv2,behley2019semantickitti, rangenet++}, and Figure \ref{fig:semanticKITTI} shows some qualitative results of \nickname{} on the validation split. It can be seen that our \nickname{} surpasses all point based approaches \cite{qi2017pointnet,landrieu2018large,su2018splatnet,qi2017pointnet++,tangentconv} by a large margin. We also
outperform all projection based methods \cite{wu2018squeezeseg, wu2019squeezesegv2,behley2019semantickitti, rangenet++}, but not significantly, primarily because RangeNet++ \cite{rangenet++} achieves much better results on the small object category such as traffic-sign. However, our \nickname{} has $40\times$ fewer network parameters than RangeNet++ \cite{rangenet++} and is more computationally efficient as it does not require the costly steps of pre/post projection.
\noindent\textbf{(3) Evaluation on S3DIS.} The S3DIS dataset \cite{2D-3D-S} consists of 271 rooms belonging to 6 large areas. Each point cloud is a medium-sized single room ($\sim$ 20$\times$15$\times$5 meters) with dense 3D points. To evaluate the semantic segmentation of our \nickname{}, we use the standard 6-fold cross-validation in our experiments. The mean IoU (mIoU), mean class Accuracy (mAcc) and Overall Accuracy (OA) of the total 13 classes are compared.
As shown in Table \ref{tab:s3dis}, our \nickname{} achieves on-par or better performance than state-of-the-art methods. Note that, most of these baselines \cite{qi2017pointnet++, li2018pointcnn, pointweb, zhang2019shellnet, dgcnn, chen2019lsanet} tend to use sophisticated but expensive operations or samplings to optimize the networks on small blocks (e.g., 1$\times$1 meter) of point clouds, and the relatively small rooms act in their favours to be divided into tiny blocks. By contrast, \nickname{} takes the entire rooms as input and is able to efficiently infer per-point semantics in a single pass.
\begin{table}[thb]
\centering
\caption{Quantitative results of different approaches on the S3DIS dataset \cite{2D-3D-S} (6-fold cross validation). Only the recent published methods are included.}
\label{tab:s3dis}
\resizebox{0.8\textwidth}{!}{%
\begin{tabular}{rccc}
\toprule[1.0pt]
& OA(\%) & mAcc(\%) & mIoU(\%) \\
\toprule[1.0pt]
PointNet \cite{qi2017pointnet} & 78.6 & 66.2 & 47.6 \\
PointNet++ \cite{qi2017pointnet++} & 81.0 & 67.1 & 54.5 \\
DGCNN \cite{dgcnn} & 84.1 & - & 56.1 \\
3P-RNN \cite{3PRNN} & 86.9 & - & 56.3 \\
RSNet \cite{RSNet} & - & 66.5 & 56.5 \\
SPG \cite{landrieu2018large} & 85.5 & 73.0 & 62.1 \\
LSANet \cite{chen2019lsanet} & 86.8 & - & 62.2 \\
PointCNN \cite{li2018pointcnn} & 88.1 & 75.6 & 65.4 \\
PointWeb \cite{pointweb} & 87.3 & 76.2 & 66.7 \\
ShellNet \cite{zhang2019shellnet} & 87.1 & - & 66.8 \\
HEPIN \cite{HPEIN} & \textbf{88.2} & - & 67.8 \\
KPConv \cite{thomas2019kpconv} & - & 79.1 & \textbf{70.6} \\
\textbf{\nickname{} (Ours)} & 88.0 & \textbf{82.0} & 70.0 \\
\toprule[1.0pt]
\end{tabular}%
}
\end{table}
\vspace{-0.2cm}
\subsection{Ablation Study}
\label{sec:ablation}
Since the impact of random sampling is fully studied in Section \ref{sec:eff_sampling}, we conduct the following ablation studies for our local feature aggregation module. All ablated networks are trained on sequences 00$\sim$07 and 09$\sim$10, and tested on the sequence 08 of SemanticKITTI dataset \cite{behley2019semantickitti}.
\textbf{(1) Removing local spatial encoding (LocSE).} This unit enables each 3D point to explicitly observe its local geometry. After removing locSE, we directly feed the local point features into the subsequent attentive pooling.
\textbf{(2$\sim$4) Replacing attentive pooling by max/mean/sum pooling.} The attentive pooling unit learns to automatically combine all local point features. By comparison, the widely used max/mean/sum poolings tend to hard select or combine features, therefore their performance may be sub-optimal.
\textbf{(5) Simplifying the dilated residual block.} The dilated residual block stacks multiple LocSE units and attentive poolings, substantially dilating the receptive field for each 3D point. By simplifying this block, we use only one LocSE unit and attentive pooling per layer, i.e. we do not chain multiple blocks as in our original \nickname{}.
Table \ref{tab:ablative} compares the mIoU scores of all ablated networks. From this, we can see that: 1) The greatest impact is caused by the removal of the chained spatial embedding and attentive pooling blocks. This is highlighted in Figure \ref{fig:Residual}, which shows how using two chained blocks allows information to be propagated from a wider neighbourhood, i.e. approximately $K^2$ points as opposed to just $K$. This is especially critical with random sampling, which is not guaranteed to preserve a particular set of points. 2) The removal of the local spatial encoding unit shows the next greatest impact on performance, demonstrating that this module is necessary to effectively learn local and relative geometry context. 3) Removing the attention module diminishes performance by not being able to effectively retain useful features. From this ablation study, we can see how the proposed neural units complement each other to attain our state-of-the-art performance.
\begin{table}[htb]
\centering
\caption{The mean IoU scores of all ablated networks based on our full \nickname{}.}
\label{tab:ablative}
\resizebox{0.8\textwidth}{!}{%
\begin{tabular}{lc}
\toprule[1.0pt]
& mIoU(\%) \\
\toprule[1.0pt]
(1) Remove local spatial encoding & 49.8 \\
(2) Replace with max-pooling & 55.2 \\
(3) Replace with mean-pooling & 53.4 \\
(4) Replace with sum-pooling & 54.3 \\
(5) Simplify dilated residual block & 48.8 \\
\textbf{(6) The Full framework (\nickname{})} & \textbf{57.1} \\
\toprule[1.0pt]
\end{tabular}%
}
\end{table}
\vspace{-0.2cm}
\section{Conclusion}
In this paper, we demonstrated that it is possible to efficiently and effectively segment large-scale point clouds by using a lightweight network architecture. In contrast to most current approaches, that rely on expensive sampling strategies, we instead use random sampling in our framework to significantly reduce the memory footprint and computational cost. A local feature aggregation module is also introduced to effectively preserve useful features from a wide neighbourhood. Extensive experiments on multiple benchmarks demonstrate the high efficiency and the state-of-the-art performance of our approach. It would be interesting to extend our framework for the end-to-end 3D instance segmentation on large-scale point clouds by drawing on the recent work \cite{3dbonet} and also for the real-time dynamic point cloud processing \cite{liu2019meteornet}.
\clearpage
\noindent\textbf{Acknowledgments:} This work was partially supported by a China Scholarship Council (CSC) scholarship. Yulan Guo was supported by the National Natural Science Foundation of China (No. 61972435), Natural Science Foundation of Guangdong Province (2019A1515011271), and Shenzhen Technology and Innovation Committee.
{\small
\bibliographystyle{ieee_fullname}
\bibliography{egbib}
}
\clearpage
\begin{appendices}
\begin{figure*}[t]
\centering
\includegraphics[width=1\textwidth]{figs/supplementary/Fig7.pdf}
\caption{The detailed architecture of our \nickname{}. $(N,D)$ represents the number of points and feature dimension respectively. FC: Fully Connected layer, LFA: Local Feature Aggregation, RS: Random Sampling, MLP: shared Multi-Layer Perceptron, US: Up-sampling, DP: Dropout.}
\label{fig:network-detailed}
\end{figure*}
\section{Details for the Evaluation of Sampling.}
We provide the implementation details of different sampling approaches evaluated in Section \ref{sec:eff_sampling}. To sample $K$ points (point features) from a large-scale point cloud $\boldsymbol{P}$ with $N$ points (point features):
\begin{enumerate}
\item \noindent\textit{Farthest Point Sampling (FPS):} We follow the implementation\footnote{\url{https://github.com/charlesq34/pointnet2}} provided by PointNet++ \cite{qi2017pointnet++}, which is also widely used in \cite{li2018pointcnn, wu2018pointconv, liu2019relation, chen2019lsanet, pointweb}. In particular, FPS is implemented as an operator running on GPU.
\item \noindent\textit{Inverse Density Importance Sampling (IDIS):}
Given a point $p_i$, its density $\rho$ is approximated by calculating the summation of the distances between $p_i$ and its nearest $t$ points \cite{Groh2018flexconv}. Formally:
\begin{equation}
\rho (p_{i})= \sum_{j=1}^{t} \left | \left | p_{i}-p_{i}^{j} \right | \right |, p_{i}^{j}\in \mathcal{N}(p_{i})
\end{equation}
\noindent where $p_{i}^{j}$ represents the coordinates (i.e. x-y-z) of the $j^{th}$ point of the neighbour points set $\mathcal{N}(p_{i})$, $t$ is set to 16. All the points are ranked according to the inverse density $\frac{1}{\rho}$ of points. Finally, the top $K$ points are selected.
\item \noindent\textit{Random Sampling (RS):} We implement random sampling with the python numpy package. Specifically, we first use the numpy function \textit{numpy.random.choice()} to generate $K$ indices. We then gather the corresponding spatial coordinates and per-point features from point clouds by using these indices.
\item \noindent\textit{Generator-based Sampling (GS):} The implementation follows the code\footnote{\url{https://github.com/orendv/learning_to_sample}} provided by \cite{learning2sample}. We first train a ProgressiveNet \cite{learning2sample} to transform the raw point clouds into ordered point sets according to their relevance to the task. After that, the first $K$ points are kept, while the rest is discarded.
\item \noindent\textit{Continuous Relaxation based Sampling (CRS):} \textit{CRS} is implemented with the self-attended gumbel-softmax sampling \cite{concrete}\cite{Yang2019ModelingPC}. Given a point feature set $\boldsymbol{P} \in \mathbb{R}^{N\times (d+3)}$ with 3D coordinates and per point features, we firstly estimate a probability score vector $\mathbf{s} \in \mathbb{R}^{N}$ through a score function parameterized by a MLP layer, i.e., $\mathbf{s}=softmax(MLP(\boldsymbol{P}))$, which learns a categorical distribution. Then, with the Gumbel noise $\mathbf{g} \in \mathbb{R}^{N}$ drawn from the distribution $Gumbel(0, 1)$. Each sampled point feature vector $\mathbf{y} \in \mathbb{R}^{d+3}$ is calculated as follows:
\begin{equation}\label{eq: concrete sampling}
\mathbf{y} = \sum_{i=1}^N
\dfrac{\exp{((log (\mathbf{s}^{(i)})+\mathbf{g}^{(i)})/\tau)} \boldsymbol{P}^{(i)}}
{\sum_{j=1}^N \exp{((log (\mathbf{s}^{(j)}) + \mathbf{g}^{(j)})/\tau)}},
\end{equation}
where $\mathbf{s}^{(i)}$ and $\mathbf{g}^{(i)}$ indicate the $i^{th}$ element in the vector $\mathbf{s}$ and $\mathbf{g}$ respectively, $\boldsymbol{P}^{(i)}$ represents the $i^{th}$ row vector in the input matrix $\boldsymbol{P}$. $\tau > 0$ is the annealing temperature. When $\tau \rightarrow 0$, Equation~\ref{eq: concrete sampling} approaches the discrete distribution and samples each row vector in $\boldsymbol{P}$ with the probability $p(\mathbf{y}=\boldsymbol{P}^{(i)})=\mathbf{s}^{(i)}$.
\item \noindent\textit{Policy Gradients based Sampling (PGS):} Given a point feature set $\boldsymbol{P} \in \mathbb{R}^{N\times (d+3)}$ with 3D coordinates and per point features, we first predict a score $\mathbf{s}$ for each point, which is learnt by an MLP function, i.e., $\mathbf{s}=softmax(MLP(\boldsymbol{P}))+\mathbf{\epsilon}$, where $\mathbf{\epsilon} \in \mathbb{R}^N$ is a zero-mean Gaussian noise with the variance $\mathbf{\Sigma}$ for random exploration. After that, we sample $K$ vectors in $\boldsymbol{P}$ with the top $K$ scores. Sampling each point/vector can be regarded as an independent action and a sequence of them form a sequential Markov Decision Process (MDP) with the following policy function $\pi$:
\begin{equation}
a_i\sim \pi(a|\boldsymbol{P}^{(i)}; \theta, \mathbf{s})
\end{equation}
where $a_i$ is the binary decision of whether to sample the $i^{th}$ vector in $\boldsymbol{P}$ and $\theta$ is the network parameter of the MLP.
Hence to properly improve the sampling policy with an underivable sampling process, we apply REINFORCE algorithm \cite{sutton2000policy} as the gradient estimator. The segmentation accuracy $R$ is applied as the reward value for the entire sampling process as $\mathcal{J}=R$. It is optimized with the following estimated gradients:
\begin{equation}\label{eq: reinforce}
\begin{aligned}
\dfrac{\partial \mathcal{J} }{\partial \theta} \approx \dfrac{1}{M}\sum_{m=1}^M\sum_{i=1}^{N} \dfrac{\partial}{\partial \theta}\log \pi(a_i|\boldsymbol{P}^{(i)};\theta,\mathbf{\Sigma}) \times \\
(R-b^c-b(\boldsymbol{P}^{(i)})),
\end{aligned}
\end{equation}
where $M$ is the batch size, $b^c$ and $b(\boldsymbol{P}^{(i)})$ are two control variates \cite{mnih2014neural} for alleviating the high variance problem of policy gradients.
\end{enumerate}
\newpage
\section{Details of the Network Architecture}
\label{network_structure}
Figure \ref{fig:network-detailed} shows the detailed architecture of \nickname{}. The network follows the widely-used encoder-decoder architecture with skip connections. The input point cloud is first fed to a shared MLP layer to extract per-point features. Four encoding and decoding layers are then used to learn features for each point. At last, three fully-connected layers and a dropout layer are used to predict the semantic label of each point. The details of each part are as follows: \\
\noindent\textbf{Network Input:} The input is a large-scale point cloud with a size of $N\times d_{in}$ (the batch dimension is dropped for simplicity), where $N$ is the number of points, $d_{in}$ is the feature dimension of each input point. For both S3DIS \cite{2D-3D-S} and Semantic3D \cite{Semantic3D} datasets, each point is represented by its 3D coordinates and color information (i.e., x-y-z-R-G-B), while each point of the SemanticKITTI \cite{behley2019semantickitti} dataset is only represented by 3D coordinates.
\\
\noindent\textbf{Encoding Layers:} Four encoding layers are used in our network to progressively reduce the size of the point clouds and increase the per-point feature dimensions. Each encoding layer consists of a local feature aggregation module (Section \ref{LFA}) and a random sampling operation (Section \ref{Sub-sampling}). The point cloud is downsampled with a four-fold decimation ratio. In particular, only 25\% of the point features are retained after each layer, i.e., $(N\rightarrow \frac{N}{4}\rightarrow \frac{N}{16}\rightarrow \frac{N}{64}\rightarrow \frac{N}{256})$. Meanwhile, the per-point feature dimension is gradually increased each layer to preserve more information, i.e., $(8\rightarrow32\rightarrow 128\rightarrow 256\rightarrow 512)$. \\
\noindent\textbf{Decoding Layers:} Four decoding layers are used after the above encoding layers. For each layer in the decoder, we first use the KNN algorithm to find one nearest neighboring point for each query point, the point feature set is then upsampled through a nearest-neighbor interpolation. Next, the upsampled feature maps are concatenated with the intermediate feature maps produced by encoding layers through skip connections, after which a shared MLP is applied to the concatenated feature maps.\\
\noindent\textbf{Final Semantic Prediction:} The final semantic label of each point is obtained through three shared fully-connected layers ($N$, 64) $\rightarrow$ ($N$, 32) $\rightarrow$ ($N$, $n_{class}$) and a dropout layer. The dropout ratio is 0.5.\\
\noindent\textbf{Network Output:} The output of \nickname{} is the predicted semantics of all points, with a size of $ N\times n_{class}$, where $n_{class}$ is the number of classes.
\section{Additional Ablation Studies on LocSE}
In Section \ref{LFA}, we encode the relative point position based on the following equation:
\begin{equation}
\mathbf{r}_{i}^{k} = MLP\Big(p_i \oplus p_i^k \oplus (p_i-p_i^k) \oplus ||p_i-p_i^k||\Big)
\label{Eq1_sup}
\end{equation}
We further investigate the effects of different spatial information in our framework. Particularly, we conduct the following more ablative experiments for LocSE:
\begin{itemize}
\item 1) Encoding the coordinates of the point $p_i$ only.
\item 2) Encoding the coordinates of neighboring points $p_i^k$ only.
\item 3) Encoding the coordinates of the point $p_i$ and its neighboring points $p_i^k$.
\item 4) Encoding the coordinates of the point $p_i$, the neighboring points $p_i^k$, and Euclidean distance $||p_i-p_i^k||$.
\item 5) Encoding the coordinates of the point $p_i$, the neighboring points $p_i^k$, and the relative position $p_{i}-p_{i}^{k}$.
\end{itemize}
\begin{table}[thb]
\centering
\caption{The mIoU result of \nickname{} by encoding different kinds of spatial information.}
\label{tab:LocSE_ablation}
\resizebox{0.9\textwidth}{!}{%
\begin{tabular}{lc}
\toprule[1.0pt]
LocSE & mIoU(\%) \\
\toprule[1.0pt]
(1) $(p_{i})$ & 45.5\\
(2) $(p_{i}^{k})$ & 47.7\\
(3) $(p_{i}, p_{i}^{k})$ & 49.1 \\
(4) $(p_{i}, p_{i}^{k}, ||p_i-p_i^k||)$ & 50.5 \\
(5) $(p_{i}, p_{i}^{k}, p_{i}-p_{i}^{k})$ & 53.6\\
(6) $(p_{i}, p_{i}^{k}, p_{i}-p_{i}^{k}, ||p_i-p_i^k||)$ (\textbf{The Full Unit}) & 54.3 \\
\toprule[1.0pt]
\end{tabular}
}
\end{table}
Table \ref{tab:LocSE_ablation} compares the mIoU scores of all ablated networks on the SemanticKITTI \cite{behley2019semantickitti} dataset. We can see that: 1) Explicitly encoding all spatial information leads to the best mIoU performance. 2) The relative position $p_{i}-p_{i}^{k}$ plays an important role in this component, primarily because the relative point position enables the network to be aware of the local geometric patterns. 3) Only encoding the point position $p_i$ or $p_i^k$ is unlikely to improve the performance, because the relative local geometric patterns are not explicitly encoded. \\
\section{Additional Ablation Studies on Dilated Residual Block}
In our \nickname{}, we stack two LocSE and Attentive Pooling units as the standard dilated residual block to gradually increase the receptive field. To further evaluate how the number of aggregation units in the dilated residual block impact the entire network, we conduct the following two more groups of experiments.
\begin{itemize}
\item 1) We simplify the dilated residual block by using only one LocSE unit and attentive pooling.
\item 2) We add one more LocSE unit and attentive pooling, i.e., there are three aggregation units chained together.
\end{itemize}
\begin{table}[H]
\centering
\caption{The mIoU scores of \nickname{} regarding different number of aggregation units in a residual block.}
\label{tab:num_aggregation}
\resizebox{0.8\textwidth}{!}{%
\begin{tabular}{lc}
\toprule[1.0pt]
Dilated residual block & mIoU(\%) \\
\toprule[1.0pt]
(1) one aggregation unit & 49.8 \\
(2) three aggregation units & 51.1 \\
(3) two aggregation units (\textbf{The Standard Block} ) & 54.3 \\
\toprule[1.0pt]
\end{tabular}%
}
\end{table}
Table \ref{tab:num_aggregation} shows the mIoU scores of different ablated networks on the validation split of the SemanticKITTI \cite{behley2019semantickitti} dataset. It can be seen that: 1) Only one aggregation unit in the dilated residual block leads to a significant drop in segmentation performance, due to the limited receptive field. 2) Three aggregation units in each block do not improve the accuracy as expected. This is because the significantly increased receptive fields and the large number of trainable parameters tend to be overfitted.
\section{Visualization of Attention Scores}
To better understand the attentive pooling, it is desirable to visualize the learned attention scores. However, since the attentive pooling operates on a relatively small local point set (i.e., $K$=16), it is hardly able to recognize meaningful shapes from such small local regions. Alternatively, we visualize the learned attention weight matrix $W$ defined in Equation \ref{Eq2} in each layer. As shown in Figure \ref{attention_matrix}, the attention weights have large values in the first encoding layers, then gradually become smooth and stable in subsequent layers. This shows that the attentive pooling tends to choose prominent or key point features at the beginning. After the point cloud being significantly downsampled, the attentive pooling layer tends to retain the majority of those point features.
\begin{figure}[thb]
\centering
\includegraphics[width=1\textwidth]{figs/supplementary/Fig8.pdf}
\caption{Visualization of the learned attention matrix in different layers. From top left to bottom right: 16$\times$16 attention matrix, 64$\times$64 attention matrix, 128$\times$128 attention matrix, 256$\times$256 attention matrix. The yellow color represents higher attention scores.}
\label{attention_matrix}
\end{figure}
\vspace{-0.2cm}
\section{Additional Results on Semantic3D}
\label{Add_semantic3d}
More qualitative results of \nickname{} on the Semantic3D \cite{Semantic3D} dataset (\textit{reduced-8}) are shown in Figure \ref{fig:reduced8-additional}.
\begin{figure*}[t]
\centering
\includegraphics[width=0.47\textwidth]{figs/supplementary/reduced8-market-1.jpg}
\includegraphics[width=0.47\textwidth]{figs/supplementary/reduced8-market-2.jpg}
\vspace{4pt}
\vspace{4pt}
\includegraphics[width=0.47\textwidth]{figs/supplementary/reduced8-sg27-1.jpg}
\includegraphics[width=0.47\textwidth]{figs/supplementary/reduced8-sg27-2.jpg}
\vspace{4pt}
\vspace{4pt}
\includegraphics[width=0.47\textwidth]{figs/supplementary/reduced8-sg28-1.jpg}
\includegraphics[width=0.47\textwidth]{figs/supplementary/reduced8-sg28-2.jpg}
\vspace{4pt}
\vspace{4pt}
\includegraphics[width=0.47\textwidth]{figs/supplementary/reduced8-st-1.jpg}
\includegraphics[width=0.47\textwidth]{figs/supplementary/reduced8-st-2.jpg}
\vspace{8pt}
\includegraphics[trim={0 1.5cm 0 0},clip,width=\textwidth]{figs/supplementary/legend-semantic3d-horiz.pdf}
\caption{Qualitative results of \nickname{} on the \textit{reduced-8} split of Semantic3D. From left to right: full RGB colored point clouds, predicted semantic labels of full point clouds, detailed view of colored point clouds, detailed view of predicted semantic labels. Note that the ground truth of the test set is not publicly available.}
\label{fig:reduced8-additional}
\end{figure*}
\section{Additional Results on SemanticKITTI}
\begin{figure*}[t]
\centering
\includegraphics[width=1\textwidth]{figs/supplementary/Fig10.pdf}
\caption{Qualitative results of \nickname{} on the validation split of SemanticKITTI \cite{behley2019semantickitti}. Red boxes show the failure cases.}
\label{KITTI_supplementary}
\end{figure*}
Figure \ref{KITTI_supplementary} shows more qualitative results of our \nickname{} on the validation set of SemanticKITTI \cite{behley2019semantickitti}. The red boxes showcase the failure cases. It can be seen that, the points belonging to \textit{other-vehicle} are likely to be misclassified as \textit{car}, mainly because the partial point clouds without colors are extremely difficult to be distinguished between the two similar classes. In addition, our approach tends to fail in several minority classes such as \textit{bicycle}, \textit{motorcycle}, \textit{bicyclist} and \textit{motorcyclist}, due to the extremely imbalanced point distribution in the dataset. For example, the number of points for \textit{vegetation} is 7000 times more than that of \textit{motorcyclist}.
\section{Additional Results on S3DIS}
\label{Add_S3DIS}
We report the detailed 6-fold cross validation results of our \nickname{} on S3DIS \cite{2D-3D-S} in Table \ref{tab:S3DIS}. Figure \ref{fig:s3dis-additional} shows more qualitative results of our approach.
\begin{figure*}[t]
\centering
\includegraphics[width=0.7\textwidth]{figs/supplementary/area1.jpg}
\includegraphics[width=0.7\textwidth]{figs/supplementary/area2.jpg}
\includegraphics[width=0.7\textwidth]{figs/supplementary/area3.jpg}
\includegraphics[width=0.7\textwidth]{figs/supplementary/area4.jpg}
\includegraphics[width=0.7\textwidth]{figs/supplementary/area5.jpg}
\includegraphics[width=0.7\textwidth]{figs/supplementary/area6.jpg}
\includegraphics[width=\textwidth]{figs/supplementary/legend-s3dis-horiz.pdf}
\caption{Semantic segmentation results of our \nickname{} on the complete point clouds of Areas 1-6 in S3DIS. Left: full RGB input cloud; middle: predicted labels; right: ground truth.}
\label{fig:s3dis-additional}
\end{figure*}
\begin{table*}[th]
\centering
\caption{Quantitative results of different approaches on S3DIS \cite{2D-3D-S} (6-fold cross-validation). Overall Accuracy (OA, \%), mean class Accuracy (mAcc, \%), mean IoU (mIoU, \%), and per-class IoU (\%) are reported.}
\label{tab:S3DIS}
\resizebox{\textwidth}{!}{%
\begin{tabular}{rcccccccccccccccc}
\toprule[1.0pt]
& OA(\%) & mAcc(\%)& mIoU(\%) & ceil. & floor & wall & beam & col. & wind. & door & table & chair & sofa & book. & board & clut. \\
\toprule[1.0pt]
PointNet \cite{qi2017pointnet} & 78.6 & 66.2 & 47.6 & 88.0 & 88.7 & 69.3 & 42.4 & 23.1 & 47.5 & 51.6 & 54.1 & 42.0 & 9.6 & 38.2 & 29.4 & 35.2 \\
RSNet \cite{RSNet} & - & 66.5 & 56.5 & 92.5 & 92.8 & 78.6 & 32.8 & 34.4 & 51.6 & 68.1 & 59.7 & 60.1 & 16.4 & 50.2 & 44.9 & 52.0 \\
3P-RNN \cite{3PRNN} & 86.9 & - & 56.3 & 92.9 & 93.8 & 73.1 & 42.5 & 25.9 & 47.6 & 59.2 & 60.4 & 66.7 & 24.8 & 57.0 & 36.7 & 51.6 \\
SPG \cite{landrieu2018large}& 86.4 & 73.0 & 62.1 & 89.9 & 95.1 & 76.4 & 62.8 & 47.1 & 55.3 & 68.4 & \textbf{73.5} & 69.2 & 63.2 & 45.9 & 8.7 & 52.9 \\
PointCNN \cite{li2018pointcnn} & \textbf{88.1} & 75.6 & 65.4 & \textbf{94.8} & \textbf{97.3} & 75.8 & 63.3 & 51.7 & 58.4 & 57.2 & 71.6 & 69.1 & 39.1 & 61.2 & 52.2 & 58.6 \\
PointWeb \cite{pointweb} & 87.3 & 76.2 & 66.7 & 93.5 & 94.2 & 80.8 & 52.4 & 41.3 & 64.9 & 68.1 & 71.4 & 67.1 & 50.3 & 62.7 & 62.2 & 58.5 \\
ShellNet \cite{zhang2019shellnet} & 87.1 & - & 66.8 & 90.2 & 93.6 & 79.9 & 60.4 & 44.1 & 64.9 & 52.9 & 71.6 & \textbf{84.7} & 53.8 & 64.6 & 48.6 & 59.4 \\
KPConv \cite{thomas2019kpconv} &- & 79.1 &\textbf{70.6} &93.6 &92.4 & \textbf{83.1} & \textbf{63.9} & \textbf{54.3} & \textbf{66.1} & \textbf{76.6} &57.8 &64.0 & \textbf{69.3} & \textbf{74.9} &61.3 & \textbf{60.3} \\
\textbf{\nickname{} (Ours)} & 88.0 & \textbf{82.0} & 70.0 & 93.1 & 96.1 & 80.6 & 62.4 & 48.0 & 64.4 & 69.4 & 69.4 & 76.4 & 60.0 & 64.2 & \textbf{65.9} & 60.1 \\
\bottomrule[1.0pt]\end{tabular}%
}
\end{table*}
\section{Video Illustration}
We provide a video to show qualitative results of our \nickname{} on both indoor and outdoor datasets, which can be viewed at \url{https://www.youtube.com/watch?v=Ar3eY_lwzMk&t=9s}.
\end{appendices}
\end{document}

View File

@ -15,6 +15,8 @@ The report will detail all aspects of the project including an introduction, bac
- Detailed, well- justified methods with thorough legal, social, ethical and professional considerations
- Comprehensive techniques with clear discussion of challenges and solutions
- Detailed techniques with strong interpretation of findings
- Do not list points too much. Use paragraphs instead.
- Use LaTeX format.
## Detail Specification
@ -133,3 +135,31 @@ max_epoch = 100 # maximum epoch during training
learning_rate = 1e-2 # initial learning rate
lr_decays = {i: 0.95 for i in range(0, 500)} # decay rate of learning rate
```
## Result
The result of the pilot study is shown in the following table:
<!-- 70.29 | 60.12 74.53 74.21 82.48 73.62 83.03 89.68 79.81 91.93 95.22 61.86 0.00 47.31 -->
mean IoU: 70.29
ioU for each label across test sets:
```text
Label 0: 60.12
Label 1: 74.53
Label 2: 74.21
Label 3: 82.48
Label 4: 73.62
Label 5: 83.03
Label 6: 89.68
Label 7: 79.81
Label 8: 91.93
Label 9: 95.22
Label 10: 61.86
Label 11: 0.00
Label 12: 47.31
```
Average accuracy: 0.8886

4
.gitignore vendored
View File

@ -9,3 +9,7 @@
*.bbl
*.blg
notebooks
*.ipynb

12
.vscode/settings.json vendored
View File

@ -1,3 +1,13 @@
{
"cSpell.words": ["downsample", "downsampled"]
"cSpell.words": [
"Conv",
"downsample",
"downsampled",
"downsampling",
"explainability",
"hyperparameters",
"reweighting",
"undersampled",
"voxelization"
]
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 109 KiB

View File

@ -1,13 +1,17 @@
\documentclass[11pt,twocolumn]{article}
\usepackage[utf8]{inputenc}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{tikz}
\usetikzlibrary{positioning}
\usepackage{layout}
\usepackage{graphicx}
\usepackage[left=2cm,right=2cm,top=2.5cm,bottom=2.5cm]{geometry}
\setlength{\columnsep}{0.5in}
\title{CPT401 Assessment 2 Template}
\title{Detection of Foreign Objects on Railway Tracks: A Pilot Study with RandLA-Net}
\author{Hanwen Yu}
\date{April 2025}
@ -20,45 +24,282 @@
\maketitle
\abstract{This is the placeholder for abstract.}
\abstract{This pilot study investigates the feasibility of using LiDAR sensors coupled with 3D point cloud segmentation for detecting foreign objects on railway tracks. We establish a baseline performance using the RandLA-Net architecture without modifications on a dataset comprising 1031 PLY files with over 248 million points across 13 classes. The purpose of this pilot is to evaluate the baseline model's ability to identify rare foreign objects (boxes) which constitute only 0.003\% of points in our dataset. Our methodology employs random sampling at a 1/4 rate to manage computational efficiency while preserving important features through RandLA-Net's Local Feature Aggregation module. The study evaluates detection accuracy using precision and mIoU metrics, as well as computational efficiency metrics including inference time and memory consumption. This work serves as a performance benchmark for future research exploring more sophisticated architectures with attention mechanisms for improved foreign object detection on railway infrastructure.}
\section{Introduction}
Railway safety represents a critical infrastructure concern with profound implications for both public safety and economic stability. Foreign objects on railway tracks pose a significant threat that can lead to derailments, infrastructure damage, and potentially catastrophic accidents resulting in loss of life and substantial financial consequences. Traditional detection methods rely heavily on manual inspections by maintenance crews, which are inherently limited by human factors including fatigue, attention span, and the sheer scale of railway networks that need monitoring.
The report will be a 4-8 page paper prepared in LaTeX. You will submit electronic and printed copies of the paper with additional electronic appendices including all the data generated by the experiments, as well as copies of the presentation material and any other documents generated throughout the project. You should use Bitex for citations \cite{SHNEIDERMAN1996The} (see the example refs.bib file).
\section{Background}
\begin{figure}[h]
\centering
\includegraphics[width=\columnwidth]{fig/example.jpg}
\caption{Example of a railway environment with a foreign object present on the track. LiDAR point clouds capture the 3D structure of both the infrastructure and potential hazards.}\label{fig:example}
\end{figure}
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum
The advancement of sensor technologies, particularly Light Detection and Ranging (LiDAR), offers promising solutions for automated detection systems. LiDAR sensors provide high-precision 3D point cloud data capable of capturing the geometric structure of railway environments with millimeter-level accuracy. However, processing and analyzing these vast, unstructured datasets presents significant computational challenges, especially when trying to identify small, irregular objects amidst complex railway geometries.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum
\begin{figure*}[t]
\centering
\begin{tikzpicture}[
block/.style={rectangle, draw, fill=blue!20, text width=3.8cm, text centered, rounded corners, minimum height=1.5cm},
subblock/.style={rectangle, draw, fill=blue!10, text width=3.5cm, text centered, minimum height=0.8cm},
arrow/.style={thick,->,>=stealth},
node distance=0.5cm
]
% Main blocks
\node[block] (acquisition) {1. \textbf{Data Acquisition} };
\node[block, below=of acquisition] (preprocessing) {2. \textbf{Preprocessing}};
\node[block, right=of preprocessing] (feature) {3. \textbf{Feature Extraction}};
\node[block, right=of feature] (segmentation) {4. \textbf{Segmentation}};
\node[block, above=of segmentation] (postprocessing) {5. \textbf{Post-processing}};
% Sub-blocks for preprocessing
\node[subblock, below right=1.9cm and -1.9cm of preprocessing.north] (p1) {Point cloud normalization};
\node[subblock, below=0.2cm of p1] (p2) {Subsampling (e.g., random sampling)};
\node[subblock, below=0.2cm of p2] (p3) {Data augmentation };
% Sub-blocks for feature extraction
\node[subblock, below right=1.9cm and -1.9cm of feature.north] (f1) {Local geometric encoding (e.g., RandLA-Net's LocSE)};
\node[subblock, below=0.2cm of f1] (f2) {Context-aware feature learning};
% Sub-blocks for segmentation
\node[subblock, below right=1.9cm and -1.9cm of segmentation.north] (s1) {Point-wise classification with class-balanced loss functions};
\node[subblock, below=0.2cm of s1] (s2) {Confidence scoring for detection reliability};
% Arrows connecting main blocks
\draw[arrow] (acquisition) -- (preprocessing);
\draw[arrow] (preprocessing) -- (feature);
\draw[arrow] (feature) -- (segmentation);
\draw[arrow] (segmentation) -- (postprocessing);
\end{tikzpicture}
\caption{Universal pipeline for 3D point cloud segmentation with emphasis on elements critical for rare object detection.}\label{fig:segmentation_pipeline}
\end{figure*}
Recent developments in deep learning approaches for 3D point cloud segmentation, such as PointNet\cite{qiPointNetDeepLearning2017}, and RandLA-Net\cite{huRandLANetEfficientSemantic2020}, have demonstrated remarkable capabilities in processing unstructured point cloud data. These architectures can potentially transform railway safety monitoring by enabling real-time, accurate detection of foreign objects on tracks.
This pilot study aims to establish a baseline performance benchmark using RandLA-Net for foreign object detection on railway tracks. We specifically address the challenge of detecting rare objects (boxes) that constitute merely 0.003\% of our dataset—a scenario that mirrors real-world conditions where foreign objects represent an extremely small portion of the railway environment. Through this pilot, we seek to evaluate:
\begin{itemize}
\item The feasibility of using unmodified RandLA-Net architecture for detecting small, rare objects in complex railway environments
% \item The computational efficiency of the approach, including processing time and memory requirements
\item The detection accuracy metrics that will inform future research directions
\end{itemize}
The insights gained from this pilot study will guide the development of more sophisticated architectures incorporating attention mechanisms and specialized data augmentation techniques to enhance detection accuracy.
\section{Problem Statement}
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum
This pilot study addresses the challenge of detecting foreign objects on railway tracks using 3D point cloud segmentation. The problem can be formally defined as follows:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum
Given a point cloud $P = \{p_1, p_2, \ldots, p_n\}$ where each point $p_i \in \mathbb{R}^3$ represents a 3D coordinate in the railway environment, our task is to assign each point a semantic label $l_i \in \{0, 1, \ldots, C-1\}$ where $C = 13$ represents our predefined classes. The function $f: P \rightarrow L$ maps the input point cloud to a set of labels $L = \{l_1, l_2, \ldots, l_n\}$.
Semantic segmentation—where each individual point in the cloud is classified—represents the most appropriate approach for several reasons. Segmentation maintains the precise spatial relationships between objects and their environment, which is essential for distinguishing legitimate infrastructure from hazardous foreign objects. The exact shape and positioning of objects relative to critical infrastructure components like rails can be precisely captured, enabling more accurate hazard assessment. Segmentation allows simultaneous classification of all environmental elements (tracks, ground, tunnels, etc.), providing comprehensive scene understanding rather than binary foreign/non-foreign detection.
Based on our findings and the broader literature, we can identify a universal pipeline for 3D point cloud segmentation that highlights where adjustments are needed for rare object detection:\ref{fig:segmentation_pipeline}
\section{Related Work}
The semantic segmentation of 3D point cloud data has seen significant advancements in recent years. PointNet\cite{qiPointNetDeepLearning2017} pioneered direct processing of raw point clouds by using shared MLPs and symmetric functions to achieve permutation invariance, though it struggled to capture local geometric structures. To address this limitation, hierarchical architectures emerged, with KPConv\cite{thomasKPConvFlexibleDeformable2019} introducing kernel point convolutions that flexibly apply weights in Euclidean space.
Point cloud processing for large-scale environments presents unique computational challenges. The increasing volume of LiDAR data in outdoor environments, particularly for applications like autonomous driving and railway monitoring, has sparked interest in methods that balance accuracy with efficiency. PolarNet\cite{zhangPolarNetImprovedGrid2020} introduced an improved grid representation specifically for online LiDAR segmentation that balances point distribution in polar coordinates. Meanwhile, attention-based approaches like Point Transformer\cite{zhaoPointTransformer2021} have demonstrated state-of-the-art performance by adapting self-attention mechanisms to point cloud data, with subsequent iterations\cite{wuPointTransformerV22022, wuPointTransformerV32024} focusing on improving efficiency and scalability.
For extremely large point clouds, RandLA-Net\cite{huRandLANetEfficientSemantic2020} represents a significant breakthrough by employing random point sampling instead of computationally expensive alternatives like farthest point sampling. To compensate for the potentially lost information from random sampling, it introduces a Local Feature Aggregation module that preserves geometric details while maintaining computational efficiency. Recent research has also explored data augmentation strategies\cite{parkRethinkingDataAugmentation2024} to improve model robustness in adverse conditions and label-efficient approaches\cite{xieAnnotatorGenericActive2023} to reduce annotation costs.
Our work builds on RandLA-Net's efficient architecture as a baseline for evaluating foreign object detection capability in railway environments, particularly focusing on its performance with extremely imbalanced classes where target objects constitute a tiny fraction of the overall point cloud.
\section{Methodology}
\subsection{Data Acquisition and Characteristics}
For this study, we deployed multiple LiDAR sensors along selected railway segments to capture comprehensive 3D point cloud data. The resulting dataset comprises 1031 PLY files containing over 248 million points in total, with an average of 240,742.8 points per file. The file size varies considerably, with the smallest containing 50,048 points and the largest having 952,476 points, reflecting the natural variability in scene complexity across different railway segments.
\begin{table*}[t]
\centering
\caption{Distribution of semantic classes in the railway LiDAR dataset}\label{tab:label_distribution}
\begin{tabular}{clrr}
\hline
Label & Class Name & Point Count & Percentage \\
\hline
0 & Track & 16,653,029 & 6.71\% \\
1 & Track Surface & 39,975,480 & 16.11\% \\
2 & Ditch & 7,937,154 & 3.20\% \\
3 & Masts & 4,596,199 & 1.85\% \\
4 & Cable & 2,562,683 & 1.03\% \\
5 & Tunnel & 31,412,582 & 12.66\% \\
6 & Ground & 73,861,934 & 29.76\% \\
7 & Fence & 7,834,499 & 3.16\% \\
8 & Mountain & 51,685,366 & 20.82\% \\
9 & Train & 9,047,963 & 3.65\% \\
10 & Human & 275,077 & 0.11\% \\
11 & Box (foreign object) & 3,080 & 0.001\% \\
12 & Others & 2,360,810 & 0.95\% \\
\hline
\textbf{Total} & & \textbf{248,205,859} & \textbf{100\%} \\
\hline
\end{tabular}
\end{table*}
The dataset was manually annotated with 13 semantic classes representing common elements in railway environments. These classes include infrastructure components (track, ditch, masts, cable, tunnel, fence), environmental features (ground, mountain), dynamic objects (train, human), and critically, our target class—boxes—which represents foreign objects on the tracks. The ``others'' class serves as a background category for points that do not fit into the defined classes.
Table\ref{tab:label_distribution} shows the distribution of semantic classes in our dataset. The class distribution exhibits extreme imbalance, which mirrors the real-world scenario where foreign objects are rare occurrences. Specifically, the target ``box'' class (label 11) constitutes merely 3,080 points or approximately 0.003\% of the entire dataset. In contrast, common environmental elements like ground (label 6) and mountain (label 8) make up 29.76\% and 20.82\% of the points respectively. This severe imbalance presents a significant challenge for the detection task, as the model must learn to identify extremely rare objects without being overwhelmed by the dominant classes.
The dataset was split into a training set consisting of 858 files (approximately 83\%) and a test set of 172 files (approximately 17\%). This split was performed randomly to ensure that the model would be evaluated on a representative and unbiased sample of the data, covering the full range of environmental conditions and object distributions present in the dataset. After this split, 18 files in training set contains foreign object (label 11), while only 1 file in the test set contains foreign object.
\subsection{Dataset Preprocessing}
The primary challenge of data preprocessing was managing the computational burden of the massive point clouds while preserving the geometric information critical for segmentation, particularly for the rare foreign object points.
We utilized random sampling as our primary downsampling method to reduce the point cloud density, following the approach employed in RandLA-Net. In the specific implementation, Each point cloud was downsampled at a ratio of 1/4, meaning we retained 25\% of the original points for training and testing. This approach offers significant computational advantages over more complex sampling techniques like farthest point sampling, which has a computational complexity of $O(n^2)$ compared to the $O(n)$ complexity of random sampling.
Although random sampling risks discarding informative points by chance, particularly from the already sparse foreign object class, RandLA-Net compensates for this potential loss through its Local Feature Aggregation (LFA) module. This module progressively enlarges the receptive field of each point, effectively capturing local geometric patterns despite the aggressive downsampling.
\textbf{No additional data augmentation techniques were applied} during this pilot study, as our focus was on establishing a baseline performance using the standard RandLA-Net architecture without modifications. This approach allows us to isolate the inherent capabilities of the model on our railway dataset before exploring potential improvements through data augmentation or architectural changes in future work.
\subsection{Implementation Details}
% add a image
\begin{figure*}[h]
\centering
\includegraphics[width=17cm]{fig/Fig7.pdf}
\caption{Architecture of RandLA-Net. The network consists of an encoder-decoder structure with local feature aggregation modules to capture local geometric patterns.}\label{fig:architecture}
\end{figure*}
For this pilot study, we implemented RandLA-Net by following the architecture described in the original paper\cite{huRandLANetEfficientSemantic2020}. The network follows an encoder-decoder design with skip connections, consisting of four encoding layers and four decoding layers. Each encoding layer contains a local feature aggregation module followed by random sampling, progressively reducing point density while increasing feature dimensions to preserve important information. The overall architecture is shown in Figure\ref{fig:architecture}.
% The Local Feature Aggregation (LFA) module consists of three essential units: (1) Local Spatial Encoding (LocSE), which explicitly encodes the relative positions of neighboring points to capture local geometric patterns; (2) Attentive Pooling, which employs an attention mechanism to automatically weight and combine neighboring features, focusing on the most informative ones; and (3) Dilated Residual Blocks, which stack multiple LocSE and Attentive Pooling units with skip connections to significantly enlarge the receptive field of each point. This architecture enables each point to effectively observe up to $K^2$ neighboring points after two stacked units (where $K$ is the number of nearest neighbors), allowing the network to capture complex local structures even when many points are dropped during random sampling.
The network hyperparameters were kept at their default values as specified in the original implementation. In particular, the number of nearest neighbors K was set to 16 for the K-nearest neighbors (KNN) algorithm used in local spatial encoding. For each dilated residual block, we followed the original design of stacking two sets of local spatial encoding (LocSE) units and attentive pooling units, which provides an effective balance between accuracy and computational efficiency by expanding the receptive field to cover approximately $K^2$ points.
The point cloud sampling strategy employed a simple random sampling approach with a four-fold decimation ratio at each layer, meaning only 25\% of points were retained after each encoding layer. This aggressive downsampling is compensated for by the Local Feature Aggregation module.
For model training, we used the Adam optimizer with an initial learning rate of 0.01, decreasing by 5\% after each epoch. Given the extreme class imbalance in our dataset, we did not employ any specific reweighting strategy for the loss function in this baseline study, as our focus was on establishing fundamental performance metrics without modifications to the architecture or training process.
During inference, the entire point cloud was processed directly without any pre- or post-processing steps such as block partitioning or voxelization, demonstrating the ability of RandLA-Net to handle large-scale point clouds efficiently. All experiments were conducted on an NVIDIA RTX 4090 GPU with 24GB of memory.
\subsection{Evaluation Metrics}
To evaluate the performance of RandLA-Net on our railway foreign object detection task, we employed two primary metrics that are particularly relevant for semantic segmentation with extreme class imbalance:
\textbf{Mean Intersection over Union (mIoU):} This is our primary evaluation metric for semantic segmentation accuracy. For each class $c$, the IoU is calculated as:
\begin{equation}
\text{IoU}_c = \frac{\text{TP}_c}{\text{TP}_c + \text{FP}_c + \text{FN}_c}
\end{equation}
where $\text{TP}_c$, $\text{FP}_c$, and $\text{FN}_c$ represent true positives, false positives, and false negatives for class $c$, respectively. The mIoU is then calculated by averaging the IoU values across all classes:
\begin{equation}
\text{mIoU} = \frac{1}{C} \sum_{c=1}^{C} \text{IoU}_c
\end{equation}
where $C$ is the number of classes (13 in our case). This metric is particularly valuable for our task as it treats all classes equally regardless of their frequency in the dataset, giving appropriate weight to the rare foreign object class.
\textbf{Precision:} To specifically evaluate the model's ability to detect foreign objects (boxes), we also calculated precision for this class. Precision measures the proportion of correctly identified foreign object points among all points classified as foreign objects:
\begin{equation}
\text{Precision}_{box} = \frac{\text{TP}_{box}}{\text{TP}_{box} + \text{FP}_{box}}
\end{equation}
This metric is crucial for railway safety applications, as false positives could lead to unnecessary system interventions and operational disruptions. High precision indicates that when the model identifies a point as belonging to a foreign object, it is likely correct.
% In addition to these primary metrics, we also monitored computational efficiency through:
% \begin{itemize}
% \item \textbf{Inference Time:} The average processing time per point cloud, measured in seconds.
% \item \textbf{Memory Consumption:} The peak GPU memory usage during both training and inference.
% \end{itemize}
% These efficiency metrics are important for assessing the practical deployability of the system in real-world railway monitoring scenarios where real-time processing is essential.
\section{Results}
\section{Analysis}
Our evaluation of RandLA-Net's performance on railway point cloud segmentation with a focus on foreign object detection yielded informative baseline results. Table\ref{tab:segmentation_results} presents the IoU values achieved for each semantic class in our test set, along with the overall mean IoU.
The overall mean IoU across all classes was 70.29\%, which indicates generally good performance for most classes. However, the IoU for our target class—``Box'' (foreign object)—was 0.00\%, highlighting a critical limitation of the baseline model in detecting extremely rare objects. This poor performance for the foreign object class is directly attributable to the extreme class imbalance, with foreign objects constituting only 0.001\% of the dataset.
In terms of average segmentation accuracy across all points, RandLA-Net achieved 88.86\%, suggesting that the model performed well on the majority classes that dominate the point distribution. However, this metric masks the inability to detect the rare foreign object class, demonstrating why accuracy alone is insufficient for evaluating models on imbalanced datasets.
\begin{table}[t]
\centering
\caption{IoU results for each class achieved by RandLA-Net on the railway dataset test set}\label{tab:segmentation_results}
\begin{tabular}{clc}
\hline
Label & Class Name & IoU (\%) \\
\hline
0 & Track & 60.12 \\
1 & Track Surface & 74.53 \\
2 & Ditch & 74.21 \\
3 & Masts & 82.48 \\
4 & Cable & 73.62 \\
5 & Tunnel & 83.03 \\
6 & Ground & 89.68 \\
7 & Fence & 79.81 \\
8 & Mountain & 91.93 \\
9 & Train & 95.22 \\
10 & Human & 61.86 \\
11 & Box (foreign object) & 0.00 \\
12 & Others & 47.31 \\
\hline
\multicolumn{2}{l}{Mean IoU} & 70.29 \\
\hline
\end{tabular}
\end{table}
For computational efficiency, the model processed the entire test set of 172 point cloud files in approximately 1 minutes on our NVIDIA RTX 4090 GPU, averaging 0.4 seconds per point cloud. Peak GPU memory consumption was 7.8 GB during inference. These efficiency metrics demonstrate that RandLA-Net's computational performance is suitable for potential real-time applications in railway monitoring, despite the detection limitations.
Figure\ref{fig:segmentation_visualization} shows qualitative results of the segmentation performance. As visible in the visualization, RandLA-Net accurately segments major classes like tracks, ground, and tunnels.
These results establish a clear baseline that highlights both the strengths of RandLA-Net in efficiently processing large-scale point clouds and its limitations in handling extreme class imbalance without specialized techniques. Notably, this confirms our hypothesis that a basic implementation of RandLA-Net without modifications or class-balancing strategies is insufficient for the safety-critical task of foreign object detection on railway tracks.
\begin{figure}[h]
\centering
\includegraphics[width=\columnwidth]{fig/visualization.jpg}
\caption{Visualization of RandLA-Net segmentation results on a sample from the test set. Left: raw point cloud with RGB coloring. Right: segmentation results with different colors representing different semantic classes. Foreign objects (red circles) were not correctly identified.}\label{fig:segmentation_visualization}
\end{figure}
\section{Discussion}
The discussion section should include an outline of any ethical and legal issues related to the methodology employed and an outline of the social and professional implications of the research.
\section{Conclusion}
\section{Future Work}
The complete failure to detect foreign objects in this pilot study has significant implications for railway safety systems. In safety-critical railway operations, a single undetected foreign object can cause derailments with potentially catastrophic consequences. Therefore, despite the good overall segmentation accuracy, a system that fails to detect the most critical objects is unsuitable for deployment. This underscores the need for specialized architectures designed specifically for the detection of rare but critically important objects in railway environments.
\section{Future Work and Conclusion}
\bibliographystyle{plain} % We choose the "plain" reference style
Based on our pilot study, several promising research directions emerge for improving foreign object detection on railway tracks:
\begin{itemize}
\item Developing class-balanced sampling strategies to ensure adequate representation of rare objects
\item Adapting attention mechanisms to give greater weight to points belonging to smaller objects
\item Employing specialized loss functions for handling extreme class imbalance
\item Incorporating domain-specific knowledge about railway environments
\end{itemize}
Our baseline evaluation demonstrates that while RandLA-Net offers impressive computational efficiency for large-scale point cloud processing, its unmodified architecture is fundamentally unsuitable for the safety-critical task of detecting rare foreign objects on railway tracks. Future work should focus on addressing the identified limitations while maintaining the computational advantages that make real-time processing feasible in railway monitoring applications.
\section{Legal and Ethical Considerations}
The deployment of automated detection systems in railway infrastructure raises several important considerations. From a legal perspective, the poor detection performance highlights concerns about reliability and accountability—determining liability in failure-induced accidents becomes complex. The lack of transparency in deep neural networks raises ethical concerns about explainability, particularly important in safety-critical systems where operators and regulators need to understand system decisions.
Data privacy is another concern, as LiDAR data collection along railway corridors may capture information beyond the tracks, including people and private property. Additionally, there are currently no standardized testing or certification protocols for AI-based railway safety systems, and our findings suggest that conventional metrics may be insufficient for evaluation.
From a social perspective, automated systems may affect the railway maintenance workforce, potentially reducing manual inspection needs. While this could improve worker safety, it raises concerns about job displacement. Public trust in railway safety could also be affected by AI system deployment, with failures potentially undermining confidence in both the technology and railway operators.
\bibliographystyle{plain} % We choose the ``plain'' reference style
\bibliography{refs} % Entries are in the refs.bib file

View File

@ -1,357 +1,474 @@
@inproceedings{SHNEIDERMAN1996The,
title={The Eyes Have It : A Task by Data Type Taxonomy for Information Visualization},
author={Shneiderman, B.},
booktitle={Proc IEEE Symposium on Visual Languages},
year={1996},
@misc{avetisyanSceneScriptReconstructingScenes2024,
title = {{{SceneScript}}: {{Reconstructing Scenes With An Autoregressive Structured Language Model}}},
shorttitle = {{{SceneScript}}},
author = {Avetisyan, Armen and Xie, Christopher and {Howard-Jenkins}, Henry and Yang, Tsun-Yi and Aroudj, Samir and Patra, Suvam and Zhang, Fuyang and Frost, Duncan and Holland, Luke and Orme, Campbell and Engel, Jakob and Miller, Edward and Newcombe, Richard and Balntas, Vasileios},
year = {2024},
month = mar,
number = {arXiv:2403.13064},
eprint = {2403.13064},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2403.13064},
urldate = {2025-03-21},
abstract = {We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our proposed scene representation is inspired by recent successes in transformers \& LLMs, and departs from more traditional methods which commonly describe scenes as meshes, voxel grids, point clouds or radiance fields. Our method infers the set of structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train SceneScript, we generate and release a large-scale synthetic dataset called Aria Synthetic Environments consisting of 100k high-quality in-door scenes, with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection. Lastly, we explore an advantage for SceneScript, which is the ability to readily adapt to new commands via simple additions to the structured language, which we illustrate for tasks such as coarse 3D object part reconstruction.},
archiveprefix = {arXiv},
langid = {american},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C\:\\Users\\Dustella\\Zotero\\storage\\J4XWBZUJ\\Avetisyan et al. - 2024 - SceneScript Reconstructing Scenes With An Autoregressive Structured Language Model.pdf;C\:\\Users\\Dustella\\Zotero\\storage\\FRUKNIKI\\2403.html}
}
@article{Ronald1936The,
title={The Use of Multiple Measurements in Taxonomic Problems},
author={Ronald A. Fisher},
journal={Annals of Human Genetics},
volume={7},
number={7},
pages={179-188},
year={1936},
@misc{calvoTimePillarsTemporallyRecurrent3D2023,
title = {{{TimePillars}}: {{Temporally-Recurrent 3D LiDAR Object Detection}}},
shorttitle = {{{TimePillars}}},
author = {Calvo, Ernesto Lozano and Taveira, Bernardo and Kahl, Fredrik and Gustafsson, Niklas and Larsson, Jonathan and Tonderski, Adam},
year = {2023},
month = dec,
number = {arXiv:2312.17260},
eprint = {2312.17260},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-10-24},
abstract = {Object detection applied to LiDAR point clouds is a relevant task in robotics, and particularly in autonomous driving. Single frame methods, predominant in the field, exploit information from individual sensor scans. Recent approaches achieve good performance, at relatively low inference time. Nevertheless, given the inherent high sparsity of LiDAR data, these methods struggle in long-range detection (e.g. 200m) which we deem to be critical in achieving safe automation. Aggregating multiple scans not only leads to a denser point cloud representation, but it also brings time-awareness to the system, and provides information about how the environment is changing. Solutions of this kind, however, are often highly problem-specific, demand careful data processing, and tend not to fulfil runtime requirements. In this context we propose TimePillars, a temporally-recurrent object detection pipeline which leverages the pillar representation of LiDAR data across time, respecting hardware integration efficiency constraints, and exploiting the diversity and long-range information of the novel Zenseact Open Dataset (ZOD). Through experimentation, we prove the benefits of having recurrency, and show how basic building blocks are enough to achieve robust and efficient results.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition,Computer Science - Machine Learning,Computer Science - Robotics},
file = {C:\Users\Dustella\Zotero\storage\H98SXJK8\Calvo et al. - 2023 - TimePillars Temporally-Recurrent 3D LiDAR Object Detection.pdf}
}
@article{Mendhe2014Supervised,
title={Supervised Machine (SVM) Learning for Credit Card Fraud Detection},
author={Mendhe, Neha K. and Thakare, M. N. and Korde, G. D.},
journal={International Journal of Engineering Trends \& Technology},
volume={8},
number={3},
pages={137-139},
year={2014},
@misc{chenPointGPTAutoregressivelyGenerative2023,
title = {{{PointGPT}}: {{Auto-regressively Generative Pre-training}} from {{Point Clouds}}},
shorttitle = {{{PointGPT}}},
author = {Chen, Guangyan and Wang, Meiling and Yang, Yi and Yu, Kai and Yuan, Li and Yue, Yufeng},
year = {2023},
month = may,
number = {arXiv:2305.11487},
eprint = {2305.11487},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2305.11487},
urldate = {2025-02-28},
abstract = {Large language models (LLMs) based on the generative pre-training transformer (GPT) have demonstrated remarkable effectiveness across a diverse range of downstream tasks. Inspired by the advancements of the GPT, we present PointGPT, a novel approach that extends the concept of GPT to point clouds, addressing the challenges associated with disorder properties, low information density, and task gaps. Specifically, a point cloud auto-regressive generation task is proposed to pre-train transformer models. Our method partitions the input point cloud into multiple point patches and arranges them in an ordered sequence based on their spatial proximity. Then, an extractor-generator based transformer decoder, with a dual masking strategy, learns latent representations conditioned on the preceding point patches, aiming to predict the next one in an auto-regressive manner. Our scalable approach allows for learning high-capacity models that generalize well, achieving state-of-the-art performance on various downstream tasks. In particular, our approach achieves classification accuracies of 94.9\% on the ModelNet40 dataset and 93.4\% on the ScanObjectNN dataset, outperforming all other transformer models. Furthermore, our method also attains new state-of-the-art accuracies on all four few-shot learning benchmarks.},
archiveprefix = {arXiv},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C\:\\Users\\Dustella\\Zotero\\storage\\PGKQM2FB\\Chen et al. - 2023 - PointGPT Auto-regressively Generative Pre-training from Point Clouds.pdf;C\:\\Users\\Dustella\\Zotero\\storage\\MA6UX2FJ\\2305.html}
}
@article{Kultur2017Hybrid,
title={Hybrid approaches for detecting credit card fraud},
author={Kultur, Yigit and Caglayan, Mehmet Ufuk},
journal={Expert Systems},
volume={34},
number={2},
pages={1-13},
year={2017},
@misc{daiScanNetRichlyannotated3D2017,
title = {{{ScanNet}}: {{Richly-annotated 3D Reconstructions}} of {{Indoor Scenes}}},
shorttitle = {{{ScanNet}}},
author = {Dai, Angela and Chang, Angel X. and Savva, Manolis and Halber, Maciej and Funkhouser, Thomas and Nie{\ss}ner, Matthias},
year = {2017},
month = apr,
number = {arXiv:1702.04405},
eprint = {1702.04405},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.1702.04405},
urldate = {2025-03-27},
abstract = {A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets. Unfortunately, in the context of RGB-D scene understanding, very little data is available -- current datasets cover a small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations. To collect this data, we designed an easy-to-use and scalable RGB-D capture system that includes automated surface reconstruction and crowdsourced semantic annotation. We show that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks, including 3D object classification, semantic voxel labeling, and CAD model retrieval. The dataset is freely available at http://www.scan-net.org.},
archiveprefix = {arXiv},
langid = {american},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C\:\\Users\\Dustella\\Zotero\\storage\\D4A5TPDK\\Dai et al. - 2017 - ScanNet Richly-annotated 3D Reconstructions of Indoor Scenes.pdf;C\:\\Users\\Dustella\\Zotero\\storage\\4MLTF8R5\\1702.html}
}
@inproceedings{Masood2015Self,
title={Self-supervised learning model for skin cancer diagnosis},
author={Masood, Ammara and Al-Jumaily, Adel and Anam, Khairul},
booktitle={International IEEE/EMBS Conference on Neural Engineering},
year={2015},
@misc{dingLENetLightweightEfficient2023,
title = {{{LENet}}: {{Lightweight And Efficient LiDAR Semantic Segmentation Using Multi-Scale Convolution Attention}}},
shorttitle = {{{LENet}}},
author = {Ding, Ben},
year = {2023},
month = jun,
number = {arXiv:2301.04275},
eprint = {2301.04275},
publisher = {arXiv},
doi = {10.48550/arXiv.2301.04275},
urldate = {2024-11-25},
abstract = {LiDAR-based semantic segmentation is critical in the fields of robotics and autonomous driving as it provides a comprehensive understanding of the scene. This paper proposes a lightweight and efficient projection-based semantic segmentation network called LENet with an encoder-decoder structure for LiDAR-based semantic segmentation. The encoder is composed of a novel multi-scale convolutional attention (MSCA) module with varying receptive field sizes to capture features. The decoder employs an Interpolation And Convolution (IAC) mechanism utilizing bilinear interpolation for upsampling multi-resolution feature maps and integrating previous and current dimensional features through a single convolution layer. This approach significantly reduces the network's complexity while also improving its accuracy. Additionally, we introduce multiple auxiliary segmentation heads to further refine the network's accuracy. Extensive evaluations on publicly available datasets, including SemanticKITTI, SemanticPOSS, and nuScenes, show that our proposed method is lighter, more efficient, and robust compared to state-of-the-art semantic segmentation methods. Full implementation is available at https://github.com/fengluodb/LENet.},
archiveprefix = {arXiv},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C\:\\Users\\Dustella\\Zotero\\storage\\AITRCJIK\\Ding - 2023 - LENet Lightweight And Efficient LiDAR Semantic Segmentation Using Multi-Scale Convolution Attention.pdf;C\:\\Users\\Dustella\\Zotero\\storage\\SS3LNQJC\\2301.html}
}
@article{Ozolek2014Accurate,
title={Accurate diagnosis of thyroid follicular lesions from nuclear morphology using supervised learning},
author={Ozolek, John A. and Tosun, Akif Burak and Wang, Wei and Chen, Cheng and Kolouri, Soheil and Basu, Saurav and Huang, Hu and Rohde, Gustavo K.},
journal={Medical Image Analysis},
volume={18},
number={5},
pages={772-780},
year={2014},
@misc{guoDeepLearning3D2020,
title = {Deep {{Learning}} for {{3D Point Clouds}}: {{A Survey}}},
shorttitle = {Deep {{Learning}} for {{3D Point Clouds}}},
author = {Guo, Yulan and Wang, Hanyun and Hu, Qingyong and Liu, Hao and Liu, Li and Bennamoun, Mohammed},
year = {2020},
month = jun,
number = {arXiv:1912.12033},
eprint = {1912.12033},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-11-21},
abstract = {Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition,Computer Science - Machine Learning,Computer Science - Robotics,Electrical Engineering and Systems Science - Image and Video Processing},
file = {C:\Users\Dustella\Zotero\storage\CSB2ZYNP\Guo et al. - 2020 - Deep Learning for 3D Point Clouds A Survey.pdf}
}
@inproceedings{Srihari2009Semi,
title={Semi-supervised Learning for Handwriting Recognition},
author={Srihari, Sargur N.},
booktitle={International Conference on Document Analysis \& Recognition},
year={2009},
@misc{huRandLANetEfficientSemantic2020,
title = {{{RandLA-Net}}: {{Efficient Semantic Segmentation}} of {{Large-Scale Point Clouds}}},
shorttitle = {{{RandLA-Net}}},
author = {Hu, Qingyong and Yang, Bo and Xie, Linhai and Rosa, Stefano and Guo, Yulan and Wang, Zhihua and Trigoni, Niki and Markham, Andrew},
year = {2020},
month = may,
number = {arXiv:1911.11236},
eprint = {1911.11236},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-11-11},
abstract = {We study the problem of efficient semantic segmentation for large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/postprocessing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Extensive experiments show that our RandLA-Net can process 1 million points in a single pass with up to 200{\texttimes} faster than existing approaches. Moreover, our RandLA-Net clearly surpasses state-of-the-art approaches for semantic segmentation on two large-scale benchmarks Semantic3D and SemanticKITTI.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition,Computer Science - Machine Learning,Electrical Engineering and Systems Science - Image and Video Processing},
file = {C:\Users\Dustella\Zotero\storage\ZSDB4DWZ\Hu et al. - 2020 - RandLA-Net Efficient Semantic Segmentation of Large-Scale Point Clouds.pdf}
}
@inproceedings{Gross2008Semi,
title={Semi-supervised learning of multi-factor models for face de-identification},
author={Gross, Ralph and Sweeney, Latanya and Torre, Fernando De La and Baker, Simon}, booktitle={IEEE Conference on Computer Vision \& Pattern Recognition}, year={2008},
@misc{laiSphericalTransformerLiDARbased2023,
title = {Spherical {{Transformer}} for {{LiDAR-based 3D Recognition}}},
author = {Lai, Xin and Chen, Yukang and Lu, Fanbin and Liu, Jianhui and Jia, Jiaya},
year = {2023},
month = mar,
number = {arXiv:2303.12766},
eprint = {2303.12766},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-11-18},
abstract = {LiDAR-based 3D point cloud recognition has benefited various applications. Without specially considering the LiDAR point distribution, most current methods suffer from information disconnection and limited receptive field, especially for the sparse distant points. In this work, we study the varying-sparsity distribution of LiDAR points and present SphereFormer to directly aggregate information from dense close points to the sparse distant ones. We design radial window self-attention that partitions the space into multiple non-overlapping narrow and long windows. It overcomes the disconnection issue and enlarges the receptive field smoothly and dramatically, which significantly boosts the performance of sparse distant points. Moreover, to fit the narrow and long windows, we propose exponential splitting to yield fine-grained position encoding and dynamic feature selection to increase model representation ability. Notably, our method ranks 1st on both nuScenes and SemanticKITTI semantic segmentation benchmarks with 81.9\% and 74.8\% mIoU, respectively. Also, we achieve the 3rd place on nuScenes object detection benchmark with 72.8\% NDS and 68.5\% mAP. Code is available at https: // github.com/ dvlab-research/ SphereFormer.git.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Artificial Intelligence,Computer Science - Computer Vision and Pattern Recognition},
file = {C:\Users\Dustella\Zotero\storage\NNCDXGL9\Lai et al. - 2023 - Spherical Transformer for LiDAR-based 3D Recognition.pdf}
}
@misc{alsbury2012displaying,
title={Displaying stacked bar charts in a limited display area},
author={Alsbury, Quinton and Becerra, David},
year={2012},
month=aug # "~7",
publisher={Google Patents},
note={US Patent 8,239,765}
@misc{liRAPiDSegRangeAwarePointwise2024,
title = {{{RAPiD-Seg}}: {{Range-Aware Pointwise Distance Distribution Networks}} for {{3D LiDAR Segmentation}}},
shorttitle = {{{RAPiD-Seg}}},
author = {Li, Li and Shum, Hubert P. H. and Breckon, Toby P.},
year = {2024},
month = sep,
number = {arXiv:2407.10159},
eprint = {2407.10159},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-10-24},
abstract = {3D point clouds play a pivotal role in outdoor scene perception, especially in the context of autonomous driving. Recent advancements in 3D LiDAR segmentation often focus intensely on the spatial positioning and distribution of points for accurate segmentation. However, these methods, while robust in variable conditions, encounter challenges due to sole reliance on coordinates and point intensity, leading to poor isometric invariance and suboptimal segmentation. To tackle this challenge, our work introduces Range-Aware Pointwise Distance Distribution (RAPiD) features and the associated RAPiD-Seg architecture. Our RAPiD features exhibit rigid transformation invariance and effectively adapt to variations in point density, with a design focus on capturing the localized geometry of neighboring structures. They utilize inherent LiDAR isotropic radiation and semantic categorization for enhanced local representation and computational efficiency, while incorporating a 4D distance metric that integrates geometric and surface material reflectivity for improved semantic segmentation. To effectively embed high-dimensional RAPiD features, we propose a double-nested autoencoder structure with a novel class-aware embedding objective to encode high-dimensional features into manageable voxel-wise embeddings. Additionally, we propose RAPiD-Seg which incorporates a channel-wise attention fusion and two effective RAPiD-Seg variants, further optimizing the embedding for enhanced performance and generalization. Our method outperforms contemporary LiDAR segmentation work in terms of mIoU on SemanticKITTI (76.1) and nuScenes (83.6) datasets.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition,Computer Science - Machine Learning,Computer Science - Robotics},
file = {C:\Users\Dustella\Zotero\storage\NB5XSCWU\Li et al. - 2024 - RAPiD-Seg Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation.pdf}
}
@article{harrower2003colorbrewer,
title={ColorBrewer. org: an online tool for selecting colour schemes for maps},
author={Harrower, Mark and Brewer, Cynthia A},
journal={The Cartographic Journal},
volume={40},
number={1},
pages={27--37},
year={2003},
publisher={Taylor \& Francis}
@misc{PapersCodePointNet,
title = {Papers with {{Code}} - {{PointNet}}: {{Deep Learning}} on {{Point Sets}} for {{3D Classification}} and {{Segmentation}}},
shorttitle = {Papers with {{Code}} - {{PointNet}}},
urldate = {2024-10-25},
abstract = {🏆 SOTA for Semantic Segmentation on S3DIS (Number of params metric)},
howpublished = {https://paperswithcode.com/paper/pointnet-deep-learning-on-point-sets-for-3d},
langid = {english},
file = {C:\Users\Dustella\Zotero\storage\GTJ8KDKL\pointnet-deep-learning-on-point-sets-for-3d.html}
}
@article{fisher1936use,
title={The use of multiple measurements in taxonomic problems},
author={Fisher, Ronald A},
journal={Annals of eugenics},
volume={7},
number={2},
pages={179--188},
year={1936},
publisher={Wiley Online Library}
@misc{parkRethinkingDataAugmentation2024,
title = {Rethinking {{Data Augmentation}} for {{Robust LiDAR Semantic Segmentation}} in {{Adverse Weather}}},
author = {Park, Junsung and Kim, Kyungmin and Shim, Hyunjung},
year = {2024},
month = jul,
number = {arXiv:2407.02286},
eprint = {2407.02286},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-10-24},
abstract = {Existing LiDAR semantic segmentation methods often struggle with performance declines in adverse weather conditions. Previous research has addressed this issue by simulating adverse weather or employing universal data augmentation during training. However, these methods lack a detailed analysis and understanding of how adverse weather negatively affects LiDAR semantic segmentation performance. Motivated by this issue, we identified key factors of adverse weather and conducted a toy experiment to pinpoint the main causes of performance degradation: (1) Geometric perturbation due to refraction caused by fog or droplets in the air and (2) Point drop due to energy absorption and occlusions. Based on these findings, we propose new strategic data augmentation techniques. First, we introduced a Selective Jittering (SJ) that jitters points in the random range of depth (or angle) to mimic geometric perturbation. Additionally, we developed a Learnable Point Drop (LPD) to learn vulnerable erase patterns with Deep Q-Learning Network to approximate the point drop phenomenon from adverse weather conditions. Without precise weather simulation, these techniques strengthen the LiDAR semantic segmentation model by exposing it to vulnerable conditions identified by our data-centric analysis. Experimental results confirmed the suitability of the proposed data augmentation methods for enhancing robustness against adverse weather conditions. Our method attains a remarkable 39.5 mIoU on the SemanticKITTI-to-SemanticSTF benchmark, surpassing the previous state-of-the-art by over 5.4\%p, tripling the improvement over the baseline compared to previous methods achieved.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Artificial Intelligence,Computer Science - Computer Vision and Pattern Recognition},
file = {C:\Users\Dustella\Zotero\storage\GLAUMJCL\Park et al. - 2024 - Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather.pdf}
}
@inproceedings{he2016deep,
title={Deep residual learning for image recognition},
author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={770--778},
year={2016}
@misc{qiPointNetDeepLearning2017,
title = {{{PointNet}}: {{Deep Learning}} on {{Point Sets}} for {{3D Classification}} and {{Segmentation}}},
shorttitle = {{{PointNet}}},
author = {Qi, Charles R. and Su, Hao and Mo, Kaichun and Guibas, Leonidas J.},
year = {2017},
month = apr,
number = {arXiv:1612.00593},
eprint = {1612.00593},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-10-25},
abstract = {Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C:\Users\Dustella\Zotero\storage\SEMCPFVH\Qi et al. - 2017 - PointNet Deep Learning on Point Sets for 3D Classification and Segmentation.pdf}
}
@article{brynjolfsson2017can,
title={What can machine learning do? Workforce implications},
author={Brynjolfsson, Erik and Mitchell, Tom},
journal={Science},
volume={358},
number={6370},
pages={1530--1534},
year={2017},
publisher={American Association for the Advancement of Science}
@misc{schmidtLiDARViewSynthesis2023,
title = {{{LiDAR View Synthesis}} for {{Robust Vehicle Navigation Without Expert Labels}}},
author = {Schmidt, Jonathan and Khan, Qadeer and Cremers, Daniel},
year = {2023},
month = aug,
number = {arXiv:2308.01424},
eprint = {2308.01424},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-10-24},
abstract = {Deep learning models for self-driving cars require a diverse training dataset to manage critical driving scenarios on public roads safely. This includes having data from divergent trajectories, such as the oncoming traffic lane or sidewalks. Such data would be too dangerous to collect in the real world. Data augmentation approaches have been proposed to tackle this issue using RGB images. However, solutions based on LiDAR sensors are scarce. Therefore, we propose synthesizing additional LiDAR point clouds from novel viewpoints without physically driving at dangerous positions. The LiDAR view synthesis is done using mesh reconstruction and ray casting. We train a deep learning model, which takes a LiDAR scan as input and predicts the future trajectory as output. A waypoint controller is then applied to this predicted trajectory to determine the throttle and steering labels of the ego-vehicle. Our method neither requires expert driving labels for the original nor the synthesized LiDAR sequence. Instead, we infer labels from LiDAR odometry. We demonstrate the effectiveness of our approach in a comprehensive online evaluation and with a comparison to concurrent work. Our results show the importance of synthesizing additional LiDAR point clouds, particularly in terms of model robustness. Project page: https://jonathsch.github.io/lidar-synthesis/},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C:\Users\Dustella\Zotero\storage\YTVNTFNG\Schmidt et al. - 2023 - LiDAR View Synthesis for Robust Vehicle Navigation Without Expert Labels.pdf}
}
@inproceedings{ribeiro2016should,
title={" Why should i trust you?" Explaining the predictions of any classifier},
author={Ribeiro, Marco Tulio and Singh, Sameer and Guestrin, Carlos},
booktitle={Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining},
pages={1135--1144},
year={2016}
@misc{thomasKPConvFlexibleDeformable2019,
title = {{{KPConv}}: {{Flexible}} and {{Deformable Convolution}} for {{Point Clouds}}},
shorttitle = {{{KPConv}}},
author = {Thomas, Hugues and Qi, Charles R. and Deschaud, Jean-Emmanuel and Marcotegui, Beatriz and Goulette, Fran{\c c}ois and Guibas, Leonidas J.},
year = {2019},
month = aug,
number = {arXiv:1904.08889},
eprint = {1904.08889},
publisher = {arXiv},
doi = {10.48550/arXiv.1904.08889},
urldate = {2024-11-11},
abstract = {We present Kernel Point Convolution (KPConv), a new design of point convolution, i.e. that operates on point clouds without any intermediate representation. The convolution weights of KPConv are located in Euclidean space by kernel points, and applied to the input points close to them. Its capacity to use any number of kernel points gives KPConv more flexibility than fixed grid convolutions. Furthermore, these locations are continuous in space and can be learned by the network. Therefore, KPConv can be extended to deformable convolutions that learn to adapt kernel points to local geometry. Thanks to a regular subsampling strategy, KPConv is also efficient and robust to varying densities. Whether they use deformable KPConv for complex tasks, or rigid KPconv for simpler tasks, our networks outperform state-of-the-art classification and segmentation approaches on several datasets. We also offer ablation studies and visualizations to provide understanding of what has been learned by KPConv and to validate the descriptive power of deformable KPConv.},
archiveprefix = {arXiv},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C\:\\Users\\Dustella\\Zotero\\storage\\GFTWYHLA\\Thomas et al. - 2019 - KPConv Flexible and Deformable Convolution for Point Clouds.pdf;C\:\\Users\\Dustella\\Zotero\\storage\\99L2NSWF\\1904.html}
}
@article{gunning2017explainable,
title={Explainable artificial intelligence (xai)},
author={Gunning, David},
journal={Defense Advanced Research Projects Agency (DARPA), nd Web},
volume={2},
year={2017}
@misc{wangSegNet4DEffectiveEfficient2024,
title = {{{SegNet4D}}: {{Effective}} and {{Efficient 4D LiDAR Semantic Segmentation}} in {{Autonomous Driving Environments}}},
shorttitle = {{{SegNet4D}}},
author = {Wang, Neng and Guo, Ruibin and Shi, Chenghao and Zhang, Hui and Lu, Huimin and Zheng, Zhiqiang and Chen, Xieyuanli},
year = {2024},
month = jun,
number = {arXiv:2406.16279},
eprint = {2406.16279},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-10-24},
abstract = {4D LiDAR semantic segmentation, also referred to as multi-scan semantic segmentation, plays a crucial role in enhancing the environmental understanding capabilities of autonomous vehicles. It entails identifying the semantic category of each point in the LiDAR scan and distinguishing whether it is dynamic, a critical aspect in downstream tasks such as path planning and autonomous navigation. Existing methods for 4D semantic segmentation often rely on computationally intensive 4D convolutions for multi-scan input, resulting in poor real-time performance. In this article, we introduce SegNet4D, a novel real-time multi-scan semantic segmentation method leveraging a projection-based approach for fast motion feature encoding, showcasing outstanding performance. SegNet4D treats 4D semantic segmentation as two distinct tasks: single-scan semantic segmentation and moving object segmentation, each addressed by dedicated head. These results are then fused in the proposed motion-semantic fusion module to achieve comprehensive multi-scan semantic segmentation. Besides, we propose extracting instance information from the current scan and incorporating it into the network for instance-aware segmentation. Our approach exhibits state-of-the-art performance across multiple datasets and stands out as a real-time multi-scan semantic segmentation method. The implementation of SegNet4D will be made available at {\textbackslash}url\{https://github.com/nubot-nudt/SegNet4D\}.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C:\Users\Dustella\Zotero\storage\W87LWRFV\Wang et al. - 2024 - SegNet4D Effective and Efficient 4D LiDAR Semantic Segmentation in Autonomous Driving Environments.pdf}
}
@book{tzeng2005opening,
title={Opening the black box-data driven visualization of neural networks},
author={Tzeng, F-Y and Ma, K-L},
year={2005},
publisher={IEEE}
@misc{wuPointTransformerV22022,
title = {Point {{Transformer V2}}: {{Grouped Vector Attention}} and {{Partition-based Pooling}}},
shorttitle = {Point {{Transformer V2}}},
author = {Wu, Xiaoyang and Lao, Yixing and Jiang, Li and Liu, Xihui and Zhao, Hengshuang},
year = {2022},
month = oct,
number = {arXiv:2210.05666},
eprint = {2210.05666},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2210.05666},
urldate = {2025-03-04},
abstract = {As a pioneering work exploring transformer architecture for 3D point cloud understanding, Point Transformer achieves impressive results on multiple highly competitive benchmarks. In this work, we analyze the limitations of the Point Transformer and propose our powerful and efficient Point Transformer V2 model with novel designs that overcome the limitations of previous work. In particular, we first propose group vector attention, which is more effective than the previous version of vector attention. Inheriting the advantages of both learnable weight encoding and multi-head attention, we present a highly effective implementation of grouped vector attention with a novel grouped weight encoding layer. We also strengthen the position information for attention by an additional position encoding multiplier. Furthermore, we design novel and lightweight partition-based pooling methods which enable better spatial alignment and more efficient sampling. Extensive experiments show that our model achieves better performance than its predecessor and achieves state-of-the-art on several challenging 3D point cloud understanding benchmarks, including 3D point cloud segmentation on ScanNet v2 and S3DIS and 3D point cloud classification on ModelNet40. Our code will be available at https://github.com/Gofinge/PointTransformerV2.},
archiveprefix = {arXiv},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C:\Users\Dustella\Zotero\storage\4MVHL7TR\Wu et al. - 2022 - Point Transformer V2 Grouped Vector Attention and Partition-based Pooling.pdf}
}
@article{allix2016empirical,
title={Empirical assessment of machine learning-based malware detectors for Android},
author={Allix, Kevin and Bissyand{\'e}, Tegawend{\'e} F and J{\'e}rome, Quentin and Klein, Jacques and Le Traon, Yves and others},
journal={Empirical Software Engineering},
volume={21},
number={1},
pages={183--211},
year={2016},
publisher={Springer}
@misc{wuPointTransformerV32024,
title = {Point {{Transformer V3}}: {{Simpler}}, {{Faster}}, {{Stronger}}},
shorttitle = {Point {{Transformer V3}}},
author = {Wu, Xiaoyang and Jiang, Li and Wang, Peng-Shuai and Liu, Zhijian and Liu, Xihui and Qiao, Yu and Ouyang, Wanli and He, Tong and Zhao, Hengshuang},
year = {2024},
month = mar,
number = {arXiv:2312.10035},
eprint = {2312.10035},
publisher = {arXiv},
doi = {10.48550/arXiv.2312.10035},
urldate = {2024-11-25},
abstract = {This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.},
archiveprefix = {arXiv},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C\:\\Users\\Dustella\\Zotero\\storage\\XYSQFHA7\\Wu et al. - 2024 - Point Transformer V3 Simpler, Faster, Stronger.pdf;C\:\\Users\\Dustella\\Zotero\\storage\\FPH7SMHI\\2312.html}
}
@book{hassoun1995fundamentals,
title={Fundamentals of artificial neural networks},
author={Hassoun, Mohamad H and others},
year={1995},
publisher={MIT press}
@misc{xieAnnotatorGenericActive2023,
title = {Annotator: {{A Generic Active Learning Baseline}} for {{LiDAR Semantic Segmentation}}},
shorttitle = {Annotator},
author = {Xie, Binhui and Li, Shuang and Guo, Qingju and Liu, Chi Harold and Cheng, Xinjing},
year = {2023},
month = oct,
number = {arXiv:2310.20293},
eprint = {2310.20293},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-10-24},
abstract = {Active learning, a label-efficient paradigm, empowers models to interactively query an oracle for labeling new data. In the realm of LiDAR semantic segmentation, the challenges stem from the sheer volume of point clouds, rendering annotation labor-intensive and cost-prohibitive. This paper presents Annotator, a general and efficient active learning baseline, in which a voxel-centric online selection strategy is tailored to efficiently probe and annotate the salient and exemplar voxel girds within each LiDAR scan, even under distribution shift. Concretely, we first execute an in-depth analysis of several common selection strategies such as Random, Entropy, Margin, and then develop voxel confusion degree (VCD) to exploit the local topology relations and structures of point clouds. Annotator excels in diverse settings, with a particular focus on active learning (AL), active source-free domain adaptation (ASFDA), and active domain adaptation (ADA). It consistently delivers exceptional performance across LiDAR semantic segmentation benchmarks, spanning both simulation-to-real and real-to-real scenarios. Surprisingly, Annotator exhibits remarkable efficiency, requiring significantly fewer annotations, e.g., just labeling five voxels per scan in the SynLiDAR {$\rightarrow$} SemanticKITTI task. This results in impressive performance, achieving 87.8\% fully-supervised performance under AL, 88.5\% under ASFDA, and 94.4\% under ADA. We envision that Annotator will offer a simple, general, and efficient solution for label-efficient 3D applications.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C:\Users\Dustella\Zotero\storage\R93MHYZQ\Xie et al. - 2023 - Annotator A Generic Active Learning Baseline for LiDAR Semantic Segmentation.pdf}
}
@inproceedings{oshiro2012many,
title={How many trees in a random forest?},
author={Oshiro, Thais Mayumi and Perez, Pedro Santoro and Baranauskas, Jos{\'e} Augusto},
booktitle={International workshop on machine learning and data mining in pattern recognition},
pages={154--168},
year={2012},
organization={Springer}
@article{zengSelfsupervisedLearningPoint2024,
title = {Self-Supervised Learning for Point Cloud Data: {{A}} Survey},
shorttitle = {Self-Supervised Learning for Point Cloud Data},
author = {Zeng, Changyu and Wang, Wei and Nguyen, Anh and Xiao, Jimin and Yue, Yutao},
year = {2024},
month = mar,
journal = {Expert Systems with Applications},
volume = {237},
pages = {121354},
issn = {0957-4174},
doi = {10.1016/j.eswa.2023.121354},
urldate = {2024-11-25},
abstract = {3D point clouds are a crucial type of data collected by LiDAR sensors and widely used in transportation applications due to its concise descriptions and accurate localization. Deep neural networks (DNNs) have achieved remarkable success in processing large amount of disordered and sparse 3D point clouds, especially in various computer vision tasks, such as pedestrian detection and vehicle recognition. Among all the learning paradigms, Self-Supervised Learning (SSL), an unsupervised training paradigm that mines effective information from the data itself, is considered as an essential solution to solve the time-consuming and labor-intensive data labeling problems via smart pre-training task design. This paper provides a comprehensive survey of recent advances on SSL for point clouds. We first present an innovative taxonomy, categorizing the existing SSL methods into four broad categories based on the pretexts' characteristics. Under each category, we then further categorize the methods into more fine-grained groups and summarize the strength and limitations of the representative methods. We also compare the performance of the notable SSL methods in literature on multiple downstream tasks on benchmark datasets both quantitatively and qualitatively. Finally, we propose a number of future research directions based on the identified limitations of existing SSL research on point clouds.},
keywords = {Computer vision,Point clouds,Pretext task,Representation learning,Self-supervised learning,Transfer learning},
file = {C\:\\Users\\Dustella\\Zotero\\storage\\UEQK3BPY\\Zeng et al. - 2024 - Self-supervised learning for point cloud data A survey.pdf;C\:\\Users\\Dustella\\Zotero\\storage\\RXGEIFDA\\S0957417423018560.html}
}
@article{montavon2018methods,
title={Methods for interpreting and understanding deep neural networks},
author={Montavon, Gr{\'e}goire and Samek, Wojciech and M{\"u}ller, Klaus-Robert},
journal={Digital Signal Processing},
volume={73},
pages={1--15},
year={2018},
publisher={Elsevier}
@misc{zhangApproachingOutsideScaling2024,
title = {Approaching {{Outside}}: {{Scaling Unsupervised 3D Object Detection}} from {{2D Scene}}},
shorttitle = {Approaching {{Outside}}},
author = {Zhang, Ruiyang and Zhang, Hu and Yu, Hang and Zheng, Zhedong},
year = {2024},
month = jul,
number = {arXiv:2407.08569},
eprint = {2407.08569},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-10-24},
abstract = {The unsupervised 3D object detection is to accurately detect objects in unstructured environments with no explicit supervisory signals. This task, given sparse LiDAR point clouds, often results in compromised performance for detecting distant or small objects due to the inherent sparsity and limited spatial resolution. In this paper, we are among the early attempts to integrate LiDAR data with 2D images for unsupervised 3D detection and introduce a new method, dubbed LiDAR2D Self-paced Learning (LiSe). We argue that RGB images serve as a valuable complement to LiDAR data, offering precise 2D localization cues, particularly when scarce LiDAR points are available for certain objects. Considering the unique characteristics of both modalities, our framework devises a self-paced learning pipeline that incorporates adaptive sampling and weak model aggregation strategies. The adaptive sampling strategy dynamically tunes the distribution of pseudo labels during training, countering the tendency of models to overfit easily detected samples, such as nearby and large-sized objects. By doing so, it ensures a balanced learning trajectory across varying object scales and distances. The weak model aggregation component consolidates the strengths of models trained under different pseudo label distributions, culminating in a robust and powerful final model. Experimental evaluations validate the efficacy of our proposed LiSe method, manifesting significant improvements of +7.1\% APBEV and +3.4\% AP3D on nuScenes, and +8.3\% APBEV and +7.4\% AP3D on Lyft compared to existing techniques.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C:\Users\Dustella\Zotero\storage\V4KC9KL6\Zhang et al. - 2024 - Approaching Outside Scaling Unsupervised 3D Object Detection from 2D Scene.pdf}
}
@article{ming2018rulematrix,
title={Rulematrix: Visualizing and understanding classifiers with rules},
author={Ming, Yao and Qu, Huamin and Bertini, Enrico},
journal={IEEE transactions on visualization and computer graphics},
volume={25},
number={1},
pages={342--352},
year={2018},
publisher={IEEE}
@article{zhangDeepLearningbased3D2023,
title = {Deep Learning-Based {{3D}} Point Cloud Classification: {{A}} Systematic Survey and Outlook},
shorttitle = {Deep Learning-Based {{3D}} Point Cloud Classification},
author = {Zhang, Huang and Wang, Changshuo and Tian, Shengwei and Lu, Baoli and Zhang, Liping and Ning, Xin and Bai, Xiao},
year = {2023},
month = sep,
journal = {Displays},
volume = {79},
pages = {102456},
issn = {0141-9382},
doi = {10.1016/j.displa.2023.102456},
urldate = {2024-11-25},
abstract = {In recent years, point cloud representation has become one of the research hotspots in the field of computer vision, and has been widely used in many fields, such as autonomous driving, virtual reality, robotics, etc. Although deep learning techniques have achieved great success in processing regular structured 2D grid image data, there are still great challenges in processing irregular, unstructured point cloud data. Point cloud classification is the basis of point cloud analysis, and many deep learning-based methods have been widely used in this task. Therefore, the purpose of this paper is to provide researchers in this field with the latest research progress and future trends. First, we introduce point cloud acquisition, characteristics, and challenges. Second, we review 3D data representations, storage formats, and commonly used datasets for point cloud classification. We then summarize deep learning-based methods for point cloud classification and complement recent research work. Next, we compare and analyze the performance of the main methods. Finally, we discuss some challenges and future directions for point cloud classification.},
langid = {american},
keywords = {3D data,Classification,Deep learning,Point cloud},
file = {C\:\\Users\\Dustella\\Zotero\\storage\\FCMEI7NV\\Zhang et al. - 2023 - Deep learning-based 3D point cloud classification A systematic survey and outlook.pdf;C\:\\Users\\Dustella\\Zotero\\storage\\9427LLCF\\S0141938223000896.html}
}
@article{zeng2016towards,
title={Towards better understanding of deep learning with visualization},
author={Zeng, Haipeng},
journal={The Hong Kong University of Science and Technology},
year={2016}
@misc{zhangDetectingAnomaliesLiDAR2023,
title = {Detecting the {{Anomalies}} in {{LiDAR Pointcloud}}},
author = {Zhang, Chiyu and Han, Ji and Zou, Yao and Dong, Kexin and Li, Yujia and Ding, Junchun and Han, Xiaoling},
year = {2023},
month = jul,
number = {arXiv:2308.00187},
eprint = {2308.00187},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-10-24},
abstract = {LiDAR sensors play an important role in the perception stack of modern autonomous driving systems. Adverse weather conditions such as rain, fog and dust, as well as some (occasional) LiDAR hardware fault may cause the LiDAR to produce pointcloud with abnormal patterns such as scattered noise points and uncommon intensity values. In this paper, we propose a novel approach to detect whether a LiDAR is generating anomalous pointcloud by analyzing the pointcloud characteristics. Specifically, we develop a pointcloud quality metric based on the LiDAR points' spatial and intensity distribution to characterize the noise level of the pointcloud, which relies on pure mathematical analysis and does not require any labeling or training as learning-based methods do. Therefore, the method is scalable and can be quickly deployed either online to improve the autonomy safety by monitoring anomalies in the LiDAR data or offline to perform in-depth study of the LiDAR behavior over large amount of data. The proposed approach is studied with extensive real public road data collected by LiDARs with different scanning mechanisms and laser spectrums, and is proven to be able to effectively handle various known and unknown sources of pointcloud anomaly.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition,Computer Science - Robotics,Electrical Engineering and Systems Science - Signal Processing},
file = {C:\Users\Dustella\Zotero\storage\9VXKS8XI\Zhang et al. - 2023 - Detecting the Anomalies in LiDAR Pointcloud.pdf}
}
@article{grun2016taxonomy,
title={A taxonomy and library for visualizing learned features in convolutional neural networks},
author={Gr{\"u}n, Felix and Rupprecht, Christian and Navab, Nassir and Tombari, Federico},
journal={arXiv preprint arXiv:1606.07757},
year={2016}
@misc{zhangGSMatchingReconsideringFeature2024,
title = {{{GS-Matching}}: {{Reconsidering Feature Matching}} Task in {{Point Cloud Registration}}},
shorttitle = {{{GS-Matching}}},
author = {Zhang, Yaojie and Huang, Tianlun and Wang, Weijun and Feng, Wei},
year = {2024},
month = dec,
number = {arXiv:2412.04855},
eprint = {2412.04855},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2412.04855},
urldate = {2024-12-18},
abstract = {Traditional point cloud registration (PCR) methods for feature matching often employ the nearest neighbor policy. This leads to many-to-one matches and numerous potential inliers without any corresponding point. Recently, some approaches have framed the feature matching task as an assignment problem to achieve optimal one-to-one matches. We argue that the transition to the Assignment problem is not reliable for general correspondence-based PCR. In this paper, we propose a heuristics stable matching policy called GS-matching, inspired by the Gale-Shapley algorithm. Compared to the other matching policies, our method can perform efficiently and find more non-repetitive inliers under low overlapping conditions. Furthermore, we employ the probability theory to analyze the feature matching task, providing new insights into this research problem. Extensive experiments validate the effectiveness of our matching policy, achieving better registration recall on multiple datasets.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C:\Users\Dustella\Zotero\storage\6KUHQ38Q\Zhang et al. - 2024 - GS-Matching Reconsidering Feature Matching task in Point Cloud Registration.pdf}
}
@article{wongsuphasawat2017visualizing,
title={Visualizing dataflow graphs of deep learning models in tensorflow},
author={Wongsuphasawat, Kanit and Smilkov, Daniel and Wexler, James and Wilson, Jimbo and Mane, Dandelion and Fritz, Doug and Krishnan, Dilip and Vi{\'e}gas, Fernanda B and Wattenberg, Martin},
journal={IEEE transactions on visualization and computer graphics},
volume={24},
number={1},
pages={1--12},
year={2017},
publisher={IEEE}
@inproceedings{zhangPolarNetImprovedGrid2020,
title = {{{PolarNet}}: {{An Improved Grid Representation}} for {{Online LiDAR Point Clouds Semantic Segmentation}}},
shorttitle = {{{PolarNet}}},
booktitle = {2020 {{IEEE}}/{{CVF Conference}} on {{Computer Vision}} and {{Pattern Recognition}} ({{CVPR}})},
author = {Zhang, Yang and Zhou, Zixiang and David, Philip and Yue, Xiangyu and Xi, Zerong and Gong, Boqing and Foroosh, Hassan},
year = {2020},
month = jun,
pages = {9598--9607},
publisher = {IEEE},
address = {Seattle, WA, USA},
doi = {10.1109/CVPR42600.2020.00962},
urldate = {2024-10-25},
abstract = {The need for fine-grained perception in autonomous driving systems has resulted in recently increased research on online semantic segmentation of single-scan LiDAR. Despite the emerging datasets and technological advancements, it remains challenging due to three reasons: (1) the need for near-real-time latency with limited hardware; (2) uneven or even long-tailed distribution of LiDAR points across space; and (3) an increasing number of extremely fine-grained semantic classes. In an attempt to jointly tackle all the aforementioned challenges, we propose a new LiDAR-specific, nearest-neighbor-free segmentation algorithm --- PolarNet. Instead of using common spherical or bird's-eye-view projection, our polar bird's-eye-view representation balances the points across grid cells in a polar coordinate system, indirectly aligning a segmentation network's attention with the long-tailed distribution of the points along the radial axis. We find that our encoding scheme greatly increases the mIoU in three drastically different segmentation datasets of real urban LiDAR single scans while retaining near real-time throughput.},
copyright = {https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html},
isbn = {978-1-7281-7168-5},
langid = {english},
file = {C:\Users\Dustella\Zotero\storage\M6KA53MU\Zhang et al. - 2020 - PolarNet An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation.pdf}
}
@inproceedings{nguyen2000visualization,
title={A visualization tool for interactive learning of large decision trees},
author={Nguyen, Trong Dung and Ho, Tu Bao and Shimodaira, Hiroshi},
booktitle={Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000},
pages={28--35},
year={2000},
organization={IEEE}
@misc{zhangStreetGaussians3D2024,
title = {Street {{Gaussians}} without {{3D Object Tracker}}},
author = {Zhang, Ruida and Li, Chengxi and Zhang, Chenyangguang and Liu, Xingyu and Yuan, Haili and Li, Yanyan and Ji, Xiangyang and Lee, Gim Hee},
year = {2024},
month = dec,
number = {arXiv:2412.05548},
eprint = {2412.05548},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2412.05548},
urldate = {2024-12-18},
abstract = {Realistic scene reconstruction in driving scenarios poses significant challenges due to fast-moving objects. Most existing methods rely on labor-intensive manual labeling of object poses to reconstruct dynamic objects in canonical space and move them based on these poses during rendering. While some approaches attempt to use 3D object trackers to replace manual annotations, the limited generalization of 3D trackers -- caused by the scarcity of large-scale 3D datasets -- results in inferior reconstructions in real-world settings. In contrast, 2D foundation models demonstrate strong generalization capabilities. To eliminate the reliance on 3D trackers and enhance robustness across diverse environments, we propose a stable object tracking module by leveraging associations from 2D deep trackers within a 3D object fusion strategy. We address inevitable tracking errors by further introducing a motion learning strategy in an implicit feature space that autonomously corrects trajectory errors and recovers missed detections. Experimental results on Waymo-NOTR datasets show we achieve state-of-the-art performance. Our code will be made publicly available.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C:\Users\Dustella\Zotero\storage\SZ3XTHZS\Zhang et al. - 2024 - Street Gaussians without 3D Object Tracker.pdf}
}
@inproceedings{caruana2006empirical,
title={An empirical comparison of supervised learning algorithms},
author={Caruana, Rich and Niculescu-Mizil, Alexandru},
booktitle={Proceedings of the 23rd international conference on Machine learning},
pages={161--168},
year={2006}
@misc{zhaoPointTransformer2021,
title = {Point {{Transformer}}},
author = {Zhao, Hengshuang and Jiang, Li and Jia, Jiaya and Torr, Philip and Koltun, Vladlen},
year = {2021},
month = sep,
number = {arXiv:2012.09164},
eprint = {2012.09164},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2012.09164},
urldate = {2025-01-06},
abstract = {Self-attention networks have revolutionized natural language processing and are making impressive strides in image analysis tasks such as image classification and object detection. Inspired by this success, we investigate the application of self-attention networks to 3D point cloud processing. We design self-attention layers for point clouds and use these to construct self-attention networks for tasks such as semantic scene segmentation, object part segmentation, and object classification. Our Point Transformer design improves upon prior work across domains and tasks. For example, on the challenging S3DIS dataset for large-scale semantic scene segmentation, the Point Transformer attains an mIoU of 70.4\% on Area 5, outperforming the strongest prior model by 3.3 absolute percentage points and crossing the 70\% mIoU threshold for the first time.},
archiveprefix = {arXiv},
langid = {american},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
file = {C\:\\Users\\Dustella\\Zotero\\storage\\QCKE5JKN\\Zhao et al. - 2021 - Point Transformer.pdf;C\:\\Users\\Dustella\\Zotero\\storage\\X7ND7BWL\\2012.html}
}
@article{parker2001rank,
title={Rank and response combination from confusion matrix data},
author={Parker, JR},
journal={Information fusion},
volume={2},
number={2},
pages={113--120},
year={2001},
publisher={Elsevier}
}
@book{shneiderman2010designing,
title={Designing the user interface: strategies for effective human-computer interaction},
author={Shneiderman, Ben and Plaisant, Catherine},
year={2010},
publisher={Pearson Education India}
}
@inproceedings{graham2003using,
title={Using curves to enhance parallel coordinate visualisations},
author={Graham, Martin and Kennedy, Jessie},
booktitle={Proceedings on Seventh International Conference on Information Visualization, 2003. IV 2003.},
pages={10--16},
year={2003},
organization={IEEE}
}
@article{yuan2009scattering,
title={Scattering points in parallel coordinates},
author={Yuan, Xiaoru and Guo, Peihong and Xiao, He and Zhou, Hong and Qu, Huamin},
journal={IEEE Transactions on Visualization and Computer Graphics},
volume={15},
number={6},
pages={1001--1008},
year={2009},
publisher={IEEE}
}
@inproceedings{andrews2006evaluating,
title={Evaluating information visualisations},
author={Andrews, Keith},
booktitle={Proceedings of the 2006 AVI workshop on BEyond time and errors: novel evaluation methods for information visualization},
pages={1--5},
year={2006}
}
@inproceedings{andrews2006evaluating,
title={Evaluating information visualisations},
author={Andrews, Keith},
booktitle={Proceedings of the 2006 AVI workshop on BEyond time and errors: novel evaluation methods for information visualization},
pages={1--5},
year={2006}
}
@incollection{carpendale2008evaluating,
title={Evaluating information visualizations},
author={Carpendale, Sheelagh},
booktitle={Information visualization},
pages={19--45},
year={2008},
publisher={Springer}
}
@incollection{craig2015interactive,
title={Interactive animated mobile information visualisation},
author={Craig, Paul},
booktitle={SIGGRAPH Asia 2015 Mobile Graphics and Interactive Applications},
pages={1--6},
year={2015}
}
@inproceedings{craig2015pervasive,
title={Pervasive information visualization: toward an information visualization design methodology for multi-device co-located synchronous collaboration},
author={Craig, Paul and Huang, Xin and Chen, Huayue and Wang, Xi and Zhang, Shiyao},
booktitle={2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing},
pages={2232--2239},
year={2015},
organization={IEEE}
}
@inproceedings{craig2014animated,
title={Animated geo-temporal clusters for exploratory search in event data document collections},
author={Craig, Paul and Se{\"\i}ler, N{\'e}na Roa and Cervantes, Ana Delia Olvera},
booktitle={2014 18th International Conference on Information Visualisation},
pages={157--163},
year={2014},
organization={IEEE}
}
@inproceedings{craig2012vertical,
title={A vertical timeline visualization for the exploratory analysis of dialogue data},
author={Craig, Paul and Roa-Se{\"\i}ler, N{\'e}na},
booktitle={2012 16th International Conference on Information Visualisation},
pages={68--73},
year={2012},
organization={IEEE}
}
@article{gang2011advances,
title={Advances of Detection Technologyng for Antibiotics from Livestock and Poultry Breeding Wastewater},
author={Gang, Li and Zhiyong, Yan and Xiuyi, Tan and Junfeng, Chen},
journal={Journal of Green Science and Technology},
number={11},
pages={54},
year={2011}
}
@article{andersson2003persistence,
title={Persistence of antibiotic resistant bacteria},
author={Andersson, Dan I},
journal={Current opinion in microbiology},
volume={6},
number={5},
pages={452--456},
year={2003},
publisher={Elsevier}
}
@article{bjorkman1998virulence,
title={Virulence of antibiotic-resistant Salmonella typhimurium},
author={Bj{\"o}rkman, Johanna and Hughes, Diarmaid and Andersson, Dan I},
journal={Proceedings of the National Academy of Sciences},
volume={95},
number={7},
pages={3949--3953},
year={1998},
publisher={National Acad Sciences}
@misc{zhengGaussianADGaussianCentricEndtoEnd2024,
title = {{{GaussianAD}}: {{Gaussian-Centric End-to-End Autonomous Driving}}},
shorttitle = {{{GaussianAD}}},
author = {Zheng, Wenzhao and Wu, Junjie and Zheng, Yao and Zuo, Sicheng and Xie, Zixun and Yang, Longchao and Pan, Yong and Hao, Zhihui and Jia, Peng and Lang, Xianpeng and Zhang, Shanghang},
year = {2024},
month = dec,
number = {arXiv:2412.10371},
eprint = {2412.10371},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2412.10371},
urldate = {2024-12-18},
abstract = {Vision-based autonomous driving shows great potential due to its satisfactory performance and low costs. Most existing methods adopt dense representations (e.g., bird's eye view) or sparse representations (e.g., instance boxes) for decision-making, which suffer from the trade-off between comprehensiveness and efficiency. This paper explores a Gaussian-centric end-to-end autonomous driving (GaussianAD) framework and exploits 3D semantic Gaussians to extensively yet sparsely describe the scene. We initialize the scene with uniform 3D Gaussians and use surroundingview images to progressively refine them to obtain the 3D Gaussian scene representation. We then use sparse convolutions to efficiently perform 3D perception (e.g., 3D detection, semantic map construction). We predict 3D flows for the Gaussians with dynamic semantics and plan the ego trajectory accordingly with an objective of future scene forecasting. Our GaussianAD can be trained in an endto-end manner with optional perception labels when available. Extensive experiments on the widely used nuScenes dataset verify the effectiveness of our end-to-end GaussianAD on various tasks including motion planning, 3D occupancy prediction, and 4D occupancy forecasting. Code: https://github.com/wzzheng/GaussianAD.},
archiveprefix = {arXiv},
langid = {english},
keywords = {Computer Science - Artificial Intelligence,Computer Science - Computer Vision and Pattern Recognition,Computer Science - Machine Learning,Computer Science - Robotics},
file = {C:\Users\Dustella\Zotero\storage\G9T4HCVG\Zheng et al. - 2024 - GaussianAD Gaussian-Centric End-to-End Autonomous Driving.pdf}
}