The paper presents a method for learning joint object detection and motion forecasting through the temporal fusion of lidar point clouds in the range view representation. While there have been several works that explore temporal fusion in the bird's-eye view representation, LaserFlow is one of the first works that accomplish temporal fusion in the range view. Let’s take a closer look at what temporal fusion of lidar point clouds actually means, why it is significant for developing joint detection and forecasting models, and how it informs some of the sensor fusion approaches we use at Aurora.
Lidar points are traditionally captured by rolling-shutter sensors, where points from a full 360-degree rotation are grouped together into a set of points, often referred to as a “sweep.” A single lidar sweep measures the position of objects at the instance it was captured, so it is imperative to use multiple sweeps to reason about the motion of objects. This process of fusing multiple sweeps of lidar data captured from different timestamps is referred to as “temporal fusion” and enables the AV to reason about the velocities and future positions of vehicles or people on the road.
Points from a lidar sweep can be encoded into various spatial representations, with the range view and the bird's-eye view encodings being the most common in the autonomous driving space. The range view discretizes the lidar points in thespherical coordinate system, whereas the bird's-eye view representation discretizes points in acartesian grid. Lidar points in both representations are depicted in Figures 1 and 2.
Figure 1. A range view representation of a sweep of lidar points. The size of objects in the range view varies with range, but the grid is dense, which makes it computationally efficient.
Figure 2. The same lidar points projected in the bird's-eye view. While objects appear their true size at all ranges in the bird's-eye view, the grid is sparse—a phenomenon that is exaggerated at longer ranges.
Operating in the range view
Both representations have unique advantages that make them suitable for 3D object detection. Objects in the bird's-eye view representation appear the same size as they are in the physical world, regardless of their range. This representation is native to motion planning, as planning algorithms are mostly designed in cartesian space. Additionally, transforming a set of lidar points from one timestamp to another in bird’s-eye view is equivalent to applying a rotation and a translation. However, bird’s-eye view representations suffer from sparsity (more empty cells/voxels at longer ranges), finite region-of-interest definitions (requiring a predefined grid), and grid-resolution tradeoffs (finer grids can capture smaller objects, but increase the computational complexity). These drawbacks hinder its usage for long-range applications.
The range view representation, on the other hand, discretizes points in a spherical coordinate system, so representations are inherently dense in the sensor’s point of view, which makes them suitable for real-time perception algorithms. The compact representation is not constrained by a finite region of interest, making it ideal for long-range perception on highways where the AV approaches distant objects at high speeds. This dense representation is more computationally efficient (the dense input has a lower memory and compute footprint) and also enables the algorithm to implicitly learn occlusion-based information. However, unlike the bird’s-eye view, the size of objects in the range view varies with range. Further, temporal fusion in the range view is non-trivial (Figure 3), which makes it difficult to use for any sensor-based forecasting.
Figure 3. The above images depict a lidar sweep captured from a single lidar sensor from two different viewpoints. We show the lidar points associated with two objects in angular bins (top) as well as the range view array, which represents the rasterized image input to the model (bottom). The measurements for the two objects in the scene are shown in blue and green, respectively, and the occluded area behind them is shaded. The image on the left shows the sweep of the original viewpoint from which the data was captured, while on the right we show the same lidar points from a different viewpoint. In the original viewpoint, each angular bin contains at most one lidar point, thus preserving all the points captured by the sensor. On the right, the same lidar points from a different viewpoint are lost (depicted in red) due to occlusions from other objects, self-occlusions, or multiple points falling in the same bin. We also see that the two distinct objects now appear intermingled in the final range view array.
The challenging nature of range view temporal fusion makes LaserFlow a ground-breaking work in the field of joint object detection and forecasting, i.e., learning an object’s current geometrical shape, position, orientation, class, and future trajectory.
As shown in Figure 3, naively transforming temporal point clouds to a common range view frame leads to the loss of point information. Since a spherical projection is used, transforming points from any frame into a shared frame may cause a change in perspective. What’s more, an object that is only partially observed will show up as “holes” in the point cloud, while objects that are occluded by other objects are lost completely. This is potentially problematic in real-time highway driving applications because viewpoint changes are even more exaggerated due to the AV’s high speed.
To overcome these challenges, we proposed a novel multi-sweep temporal fusion network for fusing learned temporal features, rather than raw points. We extracted spatial features in the native viewpoint of each sweep, instead of transforming the temporal sequence of lidar sweeps to a common frame of reference. This allowed the network to learn key features in the viewpoint in which the sweep was captured and minimized information loss. We also introduced a novel feature transformer network, along with point-level features that encode the AV’s motion. We transformed these individual features that were captured in the native viewpoint into the viewpoint of the most recent sweep. These spatio-temporal features were then processed by a multi-scale feature extractor backbone, where we finally generated 3D object detections along with trajectories.
Consequently, the work outlined in LaserFlow is the first range view-based method that achieves better detection and forecasting performance than state-of-the-art bird’s-eye view methods. The compact and efficient nature of the representation also makes this method scalable for longer ranges and real-time applications, such as highway driving. Qualitative results from the model can be seen in Figure 4.
Figure 4. Model predictions using LaserFlow (vehicles in orange, pedestrians in blue, bikes in red) along with the corresponding uncertainties depicted as ellipses. The ground truth is depicted in green. We observe that the model is able to estimate how the uncertainty of future positions increases at longer time horizons.
The future of forecasting
With LaserFlow, we were able to push the boundaries of using the range view temporally. We have since followed it up with several novel works that leverage better incremental fusion (RVFuseNet), multi-sensor fusion (LC-MV), and multi-view fusion (MVFuseNet), with the latter being the current state-of-the-art for joint object detection and motion forecasting.
However, joint detection-forecasting has its limitations. Since current joint detection and motion forecasting methods only aim to capture actor dynamics, this makes them more suitable for short-term forecasts. As depicted in Figure 4, the predicted position uncertainty is much higher for long-term forecasts. These long-term forecasts do not account for the AV’s decisions, which influence how the other actors behave. An example of this would be situations involving complex interactions like stop signs, unprotected left turns, and merges.
We discuss conditional forecasts in detail inForecasting Part 1: Understanding Interaction, and then demonstrate the need for learned interleaved forecasting. Forecasting needs to incorporate actor-AV interactions in order to navigate key maneuvers for highway driving such as merging and lane-changing, and such forecasting models form the bedrock of decision-making at Aurora. Short-term forecasts from joint detection-forecasting can serve as a dynamics-based input to conditional forecasting models, thereby contributing to safe and efficient AV behavior.
Interested in working on state-of-the-art computer vision and machine learning models? Visit ourCareers page.