From Raw Sensor Data to On-Vehicle Inference: Building a 3D Object Detection Pipeline

This semester project gave me hands-on experience across the full machine learning lifecycle for autonomous driving perception: from multi-sensor data collection and calibration to pseudo-label generation, monocular 3D object detection training, evaluation, and deployment-oriented model export.

Project context

The project was carried out in the context of the EDGAR autonomous driving platform at the Technical University of Munich. The vehicle recorded synchronized data from six RGB cameras, a roof-mounted LiDAR, and radar across real driving scenarios in Munich. The main goal was to investigate how multi-sensor information could be used to reduce manual labeling effort and improve camera-based 3D object detection.

More specifically, the project focused on transferring geometric information from LiDAR into the camera domain, generating pseudo-labels for training, and evaluating whether automatically generated labels could serve as a scalable alternative to expensive human annotation.

Why this problem matters

3D object detection is a core capability in autonomous driving because it enables a vehicle to estimate the position, size, and orientation of surrounding objects such as cars, pedestrians, and cyclists. While LiDAR-based perception is highly accurate, it is expensive and not always available in deployment scenarios. Monocular camera-based detection is more scalable, but it is significantly harder because depth must be inferred from RGB images alone.

This creates a practical challenge: how can we build robust 3D perception systems without relying entirely on densely annotated LiDAR datasets?

Our project explored one answer: use multi-sensor data during development to automatically generate pseudo-labels, then train monocular detection models that are cheaper to deploy.

1. Data collection in real-world urban environments

A major part of the project began long before model training. We collected multi-sensor driving data in Munich using the EDGAR research vehicle. The recordings covered a variety of urban conditions, including busy intersections, pedestrian-heavy zones, tunnels, multilane streets, and changing lighting and weather.

The sensor suite included:

Six surround-view RGB cameras operating at 15 Hz
A Velodyne HDL-32E LiDAR operating at 20 Hz
Radar for complementary motion-aware perception
ROS 2 for synchronized sensor recording and playback

Working with real vehicle data immediately highlighted an important lesson: high-quality ML systems depend heavily on disciplined data operations. Before any modeling could begin, the raw sensor streams had to be recorded reliably, organized correctly, and prepared for downstream fusion and training.

2. Calibration and synchronization as the foundation of fusion

For any multi-sensor perception pipeline, calibration is not a preprocessing detail; it is foundational infrastructure.

To enable projection between LiDAR and image space, we calibrated the camera system using checkerboard-based intrinsic and extrinsic calibration. This step produced the parameters required to map 3D points from the LiDAR frame into 2D camera coordinates.

In practice, the core challenges were:

Estimating reliable intrinsic camera parameters
Recovering accurate extrinsic transforms between sensors
Handling synchronization mismatches between 15 Hz cameras and 20 Hz LiDAR
Minimizing projection errors caused by ego motion and temporal offsets

These issues mattered directly because even small alignment errors can degrade label transfer quality and hurt downstream model training.

3. Data fusion across camera, LiDAR, and radar

One of the most valuable parts of the project was working with multi-modal sensor data rather than treating perception as an image-only problem.

Each modality contributed different strengths:

Cameras provided rich semantic and texture information
LiDAR provided accurate 3D geometry and depth
Radar contributed robustness for motion-related cues and adverse sensing conditions

This project reinforced a key systems insight: sensor fusion is not only about combining inputs inside a neural network. It also happens at the data engineering level, where calibration, timestamp alignment, coordinate transforms, and dataset design determine whether a learning pipeline is trustworthy.

To standardize the data for experimentation, we converted ROS 2 bag recordings into the nuScenes format using a custom pipeline. This created a consistent structure for sensor files, poses, calibration metadata, and annotations, allowing us to use established tools for pseudo-labeling, evaluation, and training.

4. Data cleaning, conversion, and feature preparation

After collection, the next challenge was turning raw logs into training-ready data.

This involved:

Extracting synchronized sensor streams from ROS bags
Cleaning and validating sensor outputs
Organizing scene-level metadata
Converting EDGAR data into nuScenes format for multi-sensor processing
Converting selected camera-view data into KITTI format for monocular model training
Preparing calibration files, labels, and split information for reproducible experiments

This stage resembled real ML engineering far more than textbook modeling. A large portion of the effort went into data quality, tooling, automation, and compatibility between research frameworks.

Although “feature engineering” in modern deep learning is often less manual than in classical ML, the project still required strong representation design choices: selecting the front camera view for monocular training, generating geometry-consistent labels, and deciding how to transfer multi-sensor information into a form the model could learn from effectively.

5. Automated pseudo-label generation

The central technical idea of the project was to use pseudo-labeling to reduce the annotation burden.

We implemented an automated pipeline based on MS3D, a multi-source unsupervised domain adaptation framework for 3D object detection. Instead of relying on a single source-domain detector, MS3D aggregates predictions from multiple pre-trained 3D detectors and fuses them using KDE-based box fusion. It also leverages temporal consistency across multiple frames to improve label quality for static objects.

Our automated pipeline covered the full workflow:

Convert raw EDGAR recordings into a nuScenes-compatible dataset
Run ensemble-based pseudo-label generation on unlabeled target-domain data
Refine labels through temporal fusion and post-processing
Export pseudo-labels into structured annotation files
Prepare the resulting dataset for downstream monocular training

This stage taught me an important practical lesson: scalable AI systems are often built around pipelines, not just models. Automation, configuration management, and reproducibility were essential for turning a research method into a usable workflow.

6. Model selection: why GUPNet

For monocular 3D object detection, we selected GUPNet as the primary model architecture.

GUPNet is especially interesting because it addresses one of the hardest parts of monocular 3D detection: uncertain depth estimation. Instead of predicting depth as a single deterministic value, it models geometric uncertainty explicitly. This makes it a strong fit for camera-based 3D perception, where ambiguity is unavoidable.

The model was adapted to our custom dataset through:

Dataset conversion from nuScenes-style annotations to KITTI format
Calibration-aware preprocessing
Custom YAML-based experiment configuration
Training and evaluation on both ground-truth and pseudo-labeled data

Choosing the model was therefore not only about benchmark performance. It was also about compatibility with available data, engineering effort, and the ability to customize the pipeline end to end.

7. Training, experimentation, and evaluation

With the dataset prepared, we trained multiple GUPNet variants and compared models trained on manually annotated ground truth labels against models trained on pseudo-labels.

This part of the project covered the most recognizable ML lifecycle steps:

Configuring experiments
Building data loaders and augmentation pipelines
Training across multiple runs and hyperparameter settings
Tracking metrics and checkpoints
Comparing performance across object classes and label sources

The evaluation focused on pedestrian, car, and bicycle detection. The main result was clear: models trained on ground truth labels substantially outperformed models trained only on pseudo-labels.

The pseudo-labeling approach was promising from a scalability perspective, but its effectiveness was limited in practice by several factors:

Label noise in automatically generated annotations
Domain mismatch between source-domain detectors and EDGAR data
The inherent difficulty of monocular 3D detection, especially for small or distant objects

Even so, the outcome was highly valuable. Negative or mixed results are still meaningful in applied ML because they reveal where system bottlenecks actually are. In this case, the project showed that pseudo-labeling can support dataset scaling, but only if label quality is controlled through stronger filtering, iterative refinement, or hybrid supervision.

8. Deployment-oriented export and vehicle integration

The final stage of the project moved beyond research experimentation toward deployment.

To make the trained detector easier to integrate into the vehicle software stack, we exported the final model to ONNX. This made the model more portable for runtime environments such as ROS 2-based systems and enabled efficient inference with ONNX Runtime.

This step was especially valuable because it connected model development to real deployment constraints:

Input and output interfaces had to be clearly defined
Preprocessing and post-processing needed to be wrapped consistently
Numerical behavior had to be checked after export
The model had to be suitable for integration into a larger perception stack

For me, this closed the loop of the ML lifecycle. The project was not only about training a model that works offline; it was about preparing a perception component that could realistically be used in an autonomous vehicle workflow.

Key technical lessons

Looking back, the most important lessons from this project were not limited to deep learning alone.

1. Data quality is a first-class concern

Model performance is heavily constrained by sensor quality, calibration accuracy, label reliability, and dataset consistency.

2. ML systems are built as pipelines

The success of a perception project depends on automation, tooling, data conversion, evaluation scripts, and reproducibility just as much as on architecture choice.

3. Multi-sensor development can enable cheaper deployment

Using LiDAR and radar during development can help build stronger labels and better supervision, even when the final model is intended to operate primarily from camera input.

4. Pseudo-labeling is powerful but fragile

Pseudo-labels can scale dataset creation, but poor label quality quickly propagates into weak models. Confidence filtering, iterative self-training, and hybrid training regimes are critical.

5. Deployment should be considered early

Exportability, runtime compatibility, and integration constraints should be part of the design process, not an afterthought.

Conclusion

This semester project gave me end-to-end exposure to the lifecycle of a modern ML/DL system for autonomous driving perception. I worked across raw data collection, sensor calibration, multi-sensor fusion, data preparation, pseudo-label generation, monocular 3D object detection training, evaluation, and deployment-oriented export.

More importantly, it showed me how machine learning works in real systems: not as an isolated model, but as a tightly connected pipeline where data engineering, geometric reasoning, software tooling, experimentation, and deployment all matter.

That experience gave me a much deeper understanding of what it takes to move from a research idea to a practical perception module for real-world autonomous vehicles.

Potential short summary

This project explored the full lifecycle of a 3D object detection system for autonomous driving, from raw multi-sensor data collection to deployment-oriented model export. Using synchronized camera, LiDAR, and radar data from the EDGAR research vehicle, I built and automated a pipeline for calibration, data conversion, pseudo-label generation, monocular 3D detection training, evaluation, and ONNX-based deployment preparation. The work showed both the promise and limitations of pseudo-labeling in reducing annotation cost, while highlighting the importance of data quality, sensor alignment, reproducible pipelines, and system-level thinking in real-world ML engineering.