SAT404/assignment_3/main.tex

\documentclass[11pt,twocolumn]{article}

\usepackage[utf8]{inputenc}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{tikz}
\usetikzlibrary{positioning}
\usepackage{layout}
\usepackage{graphicx}

\usepackage[left=2cm,right=2cm,top=2.5cm,bottom=2.5cm]{geometry}
\setlength{\columnsep}{0.5in}

\title{Detection of Foreign Objects on Railway Tracks: A Pilot Study with RandLA-Net}
\author{Hanwen Yu}
\date{April 2025}

\begin{document}


\maketitle

\abstract{This pilot study investigates the feasibility of using LiDAR sensors coupled with 3D point cloud segmentation for detecting foreign objects on railway tracks. We establish a baseline performance using the RandLA-Net architecture without modifications on a dataset comprising 1031 PLY files with over 248 million points across 13 classes. The purpose of this pilot is to evaluate the baseline model's ability to identify rare foreign objects (boxes) which constitute only 0.003\% of points in our dataset. Our methodology employs random sampling at a 1/4 rate to manage computational efficiency while preserving important features through RandLA-Net's Local Feature Aggregation module. The study evaluates detection accuracy using precision and mIoU metrics, as well as computational efficiency metrics including inference time and memory consumption. This work serves as a performance benchmark for future research exploring more sophisticated architectures with attention mechanisms for improved foreign object detection on railway infrastructure.}

\section{Introduction}

Railway safety represents a critical infrastructure concern with profound implications for both public safety and economic stability. Foreign objects on railway tracks pose a significant threat that can lead to derailments, infrastructure damage, and potentially catastrophic accidents resulting in loss of life and substantial financial consequences. Traditional detection methods rely heavily on manual inspections by maintenance crews, which are inherently limited by human factors including fatigue, attention span, and the sheer scale of railway networks that need monitoring.


\begin{figure}[h]
\centering
\includegraphics[width=\columnwidth]{fig/example.jpg}
\caption{Example of a railway environment with a foreign object present on the track. LiDAR point clouds capture the 3D structure of both the infrastructure and potential hazards.}\label{fig:example}
\end{figure}

The advancement of sensor technologies, particularly Light Detection and Ranging (LiDAR), offers promising solutions for automated detection systems. LiDAR sensors provide high-precision 3D point cloud data capable of capturing the geometric structure of railway environments with millimeter-level accuracy. However, processing and analyzing these vast, unstructured datasets presents significant computational challenges, especially when trying to identify small, irregular objects amidst complex railway geometries.


\begin{figure*}[t]
    \centering
    \begin{tikzpicture}[
        block/.style={rectangle, draw, fill=blue!20, text width=3.8cm, text centered, rounded corners, minimum height=1.5cm},
        subblock/.style={rectangle, draw, fill=blue!10, text width=3.5cm, text centered, minimum height=0.8cm},
        arrow/.style={thick,->,>=stealth},
        node distance=0.5cm
    ]
    % Main blocks
    \node[block] (acquisition) {1. \textbf{Data Acquisition} };
    \node[block, below=of acquisition] (preprocessing) {2. \textbf{Preprocessing}};
    \node[block, right=of preprocessing] (feature) {3. \textbf{Feature Extraction}};
    \node[block, right=of feature] (segmentation) {4. \textbf{Segmentation}};
    \node[block, above=of segmentation] (postprocessing) {5. \textbf{Post-processing}};

    % Sub-blocks for preprocessing
    \node[subblock, below right=1.9cm and -1.9cm of preprocessing.north] (p1) {Point cloud normalization};
    \node[subblock, below=0.2cm of p1] (p2) {Subsampling (e.g., random sampling)};
    \node[subblock, below=0.2cm of p2] (p3) {Data augmentation };

    % Sub-blocks for feature extraction
    \node[subblock, below right=1.9cm and -1.9cm of feature.north] (f1) {Local geometric encoding (e.g., RandLA-Net's LocSE)};
    \node[subblock, below=0.2cm of f1] (f2) {Context-aware feature learning};


    % Sub-blocks for segmentation
    \node[subblock, below right=1.9cm and -1.9cm of segmentation.north] (s1) {Point-wise classification with class-balanced loss functions};
    \node[subblock, below=0.2cm of s1] (s2) {Confidence scoring for detection reliability};


    % Arrows connecting main blocks
    \draw[arrow] (acquisition) -- (preprocessing);
    \draw[arrow] (preprocessing) -- (feature);
    \draw[arrow] (feature) -- (segmentation);
    \draw[arrow] (segmentation) -- (postprocessing);

    \end{tikzpicture}
    \caption{Universal pipeline for 3D point cloud segmentation with emphasis on elements critical for rare object detection.}\label{fig:segmentation_pipeline}
\end{figure*}

Recent developments in deep learning approaches for 3D point cloud segmentation, such as PointNet\cite{qiPointNetDeepLearning2017}, and RandLA-Net\cite{huRandLANetEfficientSemantic2020}, have demonstrated remarkable capabilities in processing unstructured point cloud data. These architectures can potentially transform railway safety monitoring by enabling real-time, accurate detection of foreign objects on tracks.


This pilot study aims to establish a baseline performance benchmark using RandLA-Net for foreign object detection on railway tracks. We specifically address the challenge of detecting rare objects (boxes) that constitute merely 0.003\% of our dataset—a scenario that mirrors real-world conditions where foreign objects represent an extremely small portion of the railway environment. Through this pilot, we seek to evaluate:

\begin{itemize}
    \item The feasibility of using unmodified RandLA-Net architecture for detecting small, rare objects in complex railway environments
    % \item The computational efficiency of the approach, including processing time and memory requirements
    \item The detection accuracy metrics that will inform future research directions
\end{itemize}

The insights gained from this pilot study will guide the development of more sophisticated architectures incorporating attention mechanisms and specialized data augmentation techniques to enhance detection accuracy.


\section{Problem Statement}

This pilot study addresses the challenge of detecting foreign objects on railway tracks using 3D point cloud segmentation. The problem can be formally defined as follows:

Given a point cloud $P = \{p_1, p_2, \ldots, p_n\}$ where each point $p_i \in \mathbb{R}^3$ represents a 3D coordinate in the railway environment, our task is to assign each point a semantic label $l_i \in \{0, 1, \ldots, C-1\}$ where $C = 13$ represents our predefined classes. The function $f: P \rightarrow L$ maps the input point cloud to a set of labels $L = \{l_1, l_2, \ldots, l_n\}$.

Semantic segmentation—where each individual point in the cloud is classified—represents the most appropriate approach for several reasons. Segmentation maintains the precise spatial relationships between objects and their environment, which is essential for distinguishing legitimate infrastructure from hazardous foreign objects. The exact shape and positioning of objects relative to critical infrastructure components like rails can be precisely captured, enabling more accurate hazard assessment. Segmentation allows simultaneous classification of all environmental elements (tracks, ground, tunnels, etc.), providing comprehensive scene understanding rather than binary foreign/non-foreign detection.

Based on our findings and the broader literature, we can identify a universal pipeline for 3D point cloud segmentation that highlights where adjustments are needed for rare object detection:\ref{fig:segmentation_pipeline}


\section{Related Work}

The semantic segmentation of 3D point cloud data has seen significant advancements in recent years. PointNet\cite{qiPointNetDeepLearning2017} pioneered direct processing of raw point clouds by using shared MLPs and symmetric functions to achieve permutation invariance, though it struggled to capture local geometric structures. To address this limitation, hierarchical architectures emerged, with KPConv\cite{thomasKPConvFlexibleDeformable2019} introducing kernel point convolutions that flexibly apply weights in Euclidean space.

Point cloud processing for large-scale environments presents unique computational challenges. The increasing volume of LiDAR data in outdoor environments, particularly for applications like autonomous driving and railway monitoring, has sparked interest in methods that balance accuracy with efficiency. PolarNet\cite{zhangPolarNetImprovedGrid2020} introduced an improved grid representation specifically for online LiDAR segmentation that balances point distribution in polar coordinates. Meanwhile, attention-based approaches like Point Transformer\cite{zhaoPointTransformer2021} have demonstrated state-of-the-art performance by adapting self-attention mechanisms to point cloud data, with subsequent iterations\cite{wuPointTransformerV22022, wuPointTransformerV32024} focusing on improving efficiency and scalability.

For extremely large point clouds, RandLA-Net\cite{huRandLANetEfficientSemantic2020} represents a significant breakthrough by employing random point sampling instead of computationally expensive alternatives like farthest point sampling. To compensate for the potentially lost information from random sampling, it introduces a Local Feature Aggregation module that preserves geometric details while maintaining computational efficiency. Recent research has also explored data augmentation strategies\cite{parkRethinkingDataAugmentation2024} to improve model robustness in adverse conditions and label-efficient approaches\cite{xieAnnotatorGenericActive2023} to reduce annotation costs.

Our work builds on RandLA-Net's efficient architecture as a baseline for evaluating foreign object detection capability in railway environments, particularly focusing on its performance with extremely imbalanced classes where target objects constitute a tiny fraction of the overall point cloud.

\section{Methodology}
\subsection{Data Acquisition and Characteristics}

For this study, we deployed multiple LiDAR sensors along selected railway segments to capture comprehensive 3D point cloud data. The resulting dataset comprises 1031 PLY files containing over 248 million points in total, with an average of 240,742.8 points per file. The file size varies considerably, with the smallest containing 50,048 points and the largest having 952,476 points, reflecting the natural variability in scene complexity across different railway segments.

\begin{table*}[t]
    \centering
    \caption{Distribution of semantic classes in the railway LiDAR dataset}\label{tab:label_distribution}
    \begin{tabular}{clrr}
    \hline
    Label & Class Name & Point Count & Percentage \\
    \hline
    0 & Track & 16,653,029 & 6.71\% \\
    1 & Track Surface & 39,975,480 & 16.11\% \\
    2 & Ditch & 7,937,154 & 3.20\% \\
    3 & Masts & 4,596,199 & 1.85\% \\
    4 & Cable & 2,562,683 & 1.03\% \\
    5 & Tunnel & 31,412,582 & 12.66\% \\
    6 & Ground & 73,861,934 & 29.76\% \\
    7 & Fence & 7,834,499 & 3.16\% \\
    8 & Mountain & 51,685,366 & 20.82\% \\
    9 & Train & 9,047,963 & 3.65\% \\
    10 & Human & 275,077 & 0.11\% \\
    11 & Box (foreign object) & 3,080 & 0.001\% \\
    12 & Others & 2,360,810 & 0.95\% \\
    \hline
    \textbf{Total} & & \textbf{248,205,859} & \textbf{100\%} \\
    \hline
\end{tabular}
    \end{table*}

The dataset was manually annotated with 13 semantic classes representing common elements in railway environments. These classes include infrastructure components (track, ditch, masts, cable, tunnel, fence), environmental features (ground, mountain), dynamic objects (train, human), and critically, our target class—boxes—which represents foreign objects on the tracks. The ``others'' class serves as a background category for points that do not fit into the defined classes.

Table\ref{tab:label_distribution} shows the distribution of semantic classes in our dataset. The class distribution exhibits extreme imbalance, which mirrors the real-world scenario where foreign objects are rare occurrences. Specifically, the target ``box'' class (label 11) constitutes merely 3,080 points or approximately 0.003\% of the entire dataset. In contrast, common environmental elements like ground (label 6) and mountain (label 8) make up 29.76\% and 20.82\% of the points respectively. This severe imbalance presents a significant challenge for the detection task, as the model must learn to identify extremely rare objects without being overwhelmed by the dominant classes.


The dataset was split into a training set consisting of 858 files (approximately 83\%) and a test set of 172 files (approximately 17\%). This split was performed randomly to ensure that the model would be evaluated on a representative and unbiased sample of the data, covering the full range of environmental conditions and object distributions present in the dataset. After this split, 18 files in training set contains foreign object (label 11), while only 1 file in the test set contains foreign object.

\subsection{Dataset Preprocessing}

The primary challenge of data preprocessing was managing the computational burden of the massive point clouds while preserving the geometric information critical for segmentation, particularly for the rare foreign object points.

We utilized random sampling as our primary downsampling method to reduce the point cloud density, following the approach employed in RandLA-Net. In the specific implementation, Each point cloud was downsampled at a ratio of 1/4, meaning we retained 25\% of the original points for training and testing. This approach offers significant computational advantages over more complex sampling techniques like farthest point sampling, which has a computational complexity of $O(n^2)$ compared to the $O(n)$ complexity of random sampling.

Although random sampling risks discarding informative points by chance, particularly from the already sparse foreign object class, RandLA-Net compensates for this potential loss through its Local Feature Aggregation (LFA) module. This module progressively enlarges the receptive field of each point, effectively capturing local geometric patterns despite the aggressive downsampling.

\textbf{No additional data augmentation techniques were applied} during this pilot study, as our focus was on establishing a baseline performance using the standard RandLA-Net architecture without modifications. This approach allows us to isolate the inherent capabilities of the model on our railway dataset before exploring potential improvements through data augmentation or architectural changes in future work.

\subsection{Implementation Details}

% add a image
\begin{figure*}[h]
\centering
\includegraphics[width=17cm]{fig/Fig7.pdf}
\caption{Architecture of RandLA-Net. The network consists of an encoder-decoder structure with local feature aggregation modules to capture local geometric patterns.}\label{fig:architecture}
\end{figure*}

For this pilot study, we implemented RandLA-Net by following the architecture described in the original paper\cite{huRandLANetEfficientSemantic2020}. The network follows an encoder-decoder design with skip connections, consisting of four encoding layers and four decoding layers. Each encoding layer contains a local feature aggregation module followed by random sampling, progressively reducing point density while increasing feature dimensions to preserve important information. The overall architecture is shown in Figure\ref{fig:architecture}.

% The Local Feature Aggregation (LFA) module consists of three essential units: (1) Local Spatial Encoding (LocSE), which explicitly encodes the relative positions of neighboring points to capture local geometric patterns; (2) Attentive Pooling, which employs an attention mechanism to automatically weight and combine neighboring features, focusing on the most informative ones; and (3) Dilated Residual Blocks, which stack multiple LocSE and Attentive Pooling units with skip connections to significantly enlarge the receptive field of each point. This architecture enables each point to effectively observe up to $K^2$ neighboring points after two stacked units (where $K$ is the number of nearest neighbors), allowing the network to capture complex local structures even when many points are dropped during random sampling.

The network hyperparameters were kept at their default values as specified in the original implementation. In particular, the number of nearest neighbors K was set to 16 for the K-nearest neighbors (KNN) algorithm used in local spatial encoding. For each dilated residual block, we followed the original design of stacking two sets of local spatial encoding (LocSE) units and attentive pooling units, which provides an effective balance between accuracy and computational efficiency by expanding the receptive field to cover approximately $K^2$ points.

The point cloud sampling strategy employed a simple random sampling approach with a four-fold decimation ratio at each layer, meaning only 25\% of points were retained after each encoding layer. This aggressive downsampling is compensated for by the Local Feature Aggregation module.

For model training, we used the Adam optimizer with an initial learning rate of 0.01, decreasing by 5\% after each epoch. Given the extreme class imbalance in our dataset, we did not employ any specific reweighting strategy for the loss function in this baseline study, as our focus was on establishing fundamental performance metrics without modifications to the architecture or training process.

During inference, the entire point cloud was processed directly without any pre- or post-processing steps such as block partitioning or voxelization, demonstrating the ability of RandLA-Net to handle large-scale point clouds efficiently. All experiments were conducted on an NVIDIA RTX 4090 GPU with 24GB of memory.

\subsection{Evaluation Metrics}

To evaluate the performance of RandLA-Net on our railway foreign object detection task, we employed two primary metrics that are particularly relevant for semantic segmentation with extreme class imbalance:

\textbf{Mean Intersection over Union (mIoU):} This is our primary evaluation metric for semantic segmentation accuracy. For each class $c$, the IoU is calculated as:

\begin{equation}
\text{IoU}_c = \frac{\text{TP}_c}{\text{TP}_c + \text{FP}_c + \text{FN}_c}
\end{equation}

where $\text{TP}_c$, $\text{FP}_c$, and $\text{FN}_c$ represent true positives, false positives, and false negatives for class $c$, respectively. The mIoU is then calculated by averaging the IoU values across all classes:

\begin{equation}
\text{mIoU} = \frac{1}{C} \sum_{c=1}^{C} \text{IoU}_c
\end{equation}

where $C$ is the number of classes (13 in our case). This metric is particularly valuable for our task as it treats all classes equally regardless of their frequency in the dataset, giving appropriate weight to the rare foreign object class.

\textbf{Precision:} To specifically evaluate the model's ability to detect foreign objects (boxes), we also calculated precision for this class. Precision measures the proportion of correctly identified foreign object points among all points classified as foreign objects:

\begin{equation}
\text{Precision}_{box} = \frac{\text{TP}_{box}}{\text{TP}_{box} + \text{FP}_{box}}
\end{equation}

This metric is crucial for railway safety applications, as false positives could lead to unnecessary system interventions and operational disruptions. High precision indicates that when the model identifies a point as belonging to a foreign object, it is likely correct.

% In addition to these primary metrics, we also monitored computational efficiency through:

% \begin{itemize}
%     \item \textbf{Inference Time:} The average processing time per point cloud, measured in seconds.
%     \item \textbf{Memory Consumption:} The peak GPU memory usage during both training and inference.
% \end{itemize}

% These efficiency metrics are important for assessing the practical deployability of the system in real-world railway monitoring scenarios where real-time processing is essential.

\section{Results}

Our evaluation of RandLA-Net's performance on railway point cloud segmentation with a focus on foreign object detection yielded informative baseline results. Table\ref{tab:segmentation_results} presents the IoU values achieved for each semantic class in our test set, along with the overall mean IoU.


The overall mean IoU across all classes was 70.29\%, which indicates generally good performance for most classes. However, the IoU for our target class—``Box'' (foreign object)—was 0.00\%, highlighting a critical limitation of the baseline model in detecting extremely rare objects. This poor performance for the foreign object class is directly attributable to the extreme class imbalance, with foreign objects constituting only 0.001\% of the dataset.

In terms of average segmentation accuracy across all points, RandLA-Net achieved 88.86\%, suggesting that the model performed well on the majority classes that dominate the point distribution. However, this metric masks the inability to detect the rare foreign object class, demonstrating why accuracy alone is insufficient for evaluating models on imbalanced datasets.


\begin{table}[t]
    \centering
    \caption{IoU results for each class achieved by RandLA-Net on the railway dataset test set}\label{tab:segmentation_results}
    \begin{tabular}{clc}
    \hline
    Label & Class Name & IoU (\%) \\
    \hline
    0 & Track & 60.12 \\
    1 & Track Surface & 74.53 \\
    2 & Ditch & 74.21 \\
    3 & Masts & 82.48 \\
    4 & Cable & 73.62 \\
    5 & Tunnel & 83.03 \\
    6 & Ground & 89.68 \\
    7 & Fence & 79.81 \\
    8 & Mountain & 91.93 \\
    9 & Train & 95.22 \\
    10 & Human & 61.86 \\
    11 & Box (foreign object) & 0.00 \\
    12 & Others & 47.31 \\
    \hline
    \multicolumn{2}{l}{Mean IoU} & 70.29 \\
    \hline
\end{tabular}
\end{table}
For computational efficiency, the model processed the entire test set of 172 point cloud files in approximately 1 minutes on our NVIDIA RTX 4090 GPU, averaging 0.4 seconds per point cloud. Peak GPU memory consumption was 7.8 GB during inference. These efficiency metrics demonstrate that RandLA-Net's computational performance is suitable for potential real-time applications in railway monitoring, despite the detection limitations.


Figure\ref{fig:segmentation_visualization} shows qualitative results of the segmentation performance. As visible in the visualization, RandLA-Net accurately segments major classes like tracks, ground, and tunnels.

These results establish a clear baseline that highlights both the strengths of RandLA-Net in efficiently processing large-scale point clouds and its limitations in handling extreme class imbalance without specialized techniques. Notably, this confirms our hypothesis that a basic implementation of RandLA-Net without modifications or class-balancing strategies is insufficient for the safety-critical task of foreign object detection on railway tracks.

\begin{figure}[h]
\centering
\includegraphics[width=\columnwidth]{fig/visualization.jpg}
\caption{Visualization of RandLA-Net segmentation results on a sample from the test set. Left: raw point cloud with RGB coloring. Right: segmentation results with different colors representing different semantic classes. Foreign objects (red circles) were not correctly identified.}\label{fig:segmentation_visualization}
\end{figure}

\section{Discussion}


The complete failure to detect foreign objects in this pilot study has significant implications for railway safety systems. In safety-critical railway operations, a single undetected foreign object can cause derailments with potentially catastrophic consequences. Therefore, despite the good overall segmentation accuracy, a system that fails to detect the most critical objects is unsuitable for deployment. This underscores the need for specialized architectures designed specifically for the detection of rare but critically important objects in railway environments.


\section{Future Work and Conclusion}

Based on our pilot study, several promising research directions emerge for improving foreign object detection on railway tracks:

\begin{itemize}
    \item Developing class-balanced sampling strategies to ensure adequate representation of rare objects
    \item Adapting attention mechanisms to give greater weight to points belonging to smaller objects
    \item Employing specialized loss functions for handling extreme class imbalance
    \item Incorporating domain-specific knowledge about railway environments
\end{itemize}

Our baseline evaluation demonstrates that while RandLA-Net offers impressive computational efficiency for large-scale point cloud processing, its unmodified architecture is fundamentally unsuitable for the safety-critical task of detecting rare foreign objects on railway tracks. Future work should focus on addressing the identified limitations while maintaining the computational advantages that make real-time processing feasible in railway monitoring applications.

\section{Legal and Ethical Considerations}

The deployment of automated detection systems in railway infrastructure raises several important considerations. From a legal perspective, the poor detection performance highlights concerns about reliability and accountability—determining liability in failure-induced accidents becomes complex. The lack of transparency in deep neural networks raises ethical concerns about explainability, particularly important in safety-critical systems where operators and regulators need to understand system decisions.

Data privacy is another concern, as LiDAR data collection along railway corridors may capture information beyond the tracks, including people and private property. Additionally, there are currently no standardized testing or certification protocols for AI-based railway safety systems, and our findings suggest that conventional metrics may be insufficient for evaluation.

From a social perspective, automated systems may affect the railway maintenance workforce, potentially reducing manual inspection needs. While this could improve worker safety, it raises concerns about job displacement. Public trust in railway safety could also be affected by AI system deployment, with failures potentially undermining confidence in both the technology and railway operators.

\bibliographystyle{plain} % We choose the ``plain'' reference style
\bibliography{refs} % Entries are in the refs.bib file


\end{document}