A Brief Analysis of Autonomous Driving Visual Perception Technology Route

2024.10.07

1 Background

Autonomous driving is a gradual transition from the prediction stage to the industrialization stage, which can be divided into four specific aspects. First, in the context of big data, the size of the data set is expanding rapidly, resulting in a large number of details of the prototypes previously developed on small-scale data sets being filtered out, and only those that can work effectively on large-scale data will be retained. The second is the switching of focus, from monocular to multi-camera scenarios, resulting in increased complexity. Then there is a tendency towards application-friendly design, such as the transfer of the output space from the image space to the BEV space.

Finally, we have gradually changed from pursuing accuracy to considering inference speed. At the same time, autonomous driving scenarios require fast response, so the performance requirements will take speed into consideration, and more consideration will be given to how to deploy to edge devices.

Another part of the background is that in the past 10 years, visual perception has developed rapidly driven by deep learning. There has been a lot of work and some fairly mature paradigms in mainstream directions such as classification detection and segmentation. In the process of development, visual perception in autonomous driving scenarios has borrowed a lot from these mainstream directions, such as target definition of feature encoding, perception paradigm and supervision. Therefore, before devoting oneself to autonomous driving perception, these mainstream directions should be explored.

Against this backdrop, a large number of 3D object detection works targeting large-scale datasets have emerged in the past year, as shown in Figure 1 (the algorithms marked in red are those that have been first).

Figure 1 Development of 3D object detection in the past year

02 Technical route

2.1 Lifting

The main difference between visual perception in autonomous driving scenarios and mainstream visual perception lies in the different target definition spaces. The target of mainstream visual perception is defined in image space, while the target of autonomous driving scenarios is defined in 3D space. When the input is all images, obtaining the result in 3D space requires a lift process, which is the core problem of autonomous driving visual perception.

We can divide the method to solve the Lift object problem into input, intermediate features and output. An example of the input level is the change of perspective. The principle is to use the image to infer the depth information, and then use the depth information to project the RGB value of the image into three-dimensional space to obtain a colored point cloud. The related work of point cloud detection is then used.

Currently, the more promising ones are feature-level transformation or feature-level lift, such as DETR3D, which performs spatial transformation at the feature level. The advantage of feature-level transformation is that it can avoid repeated extraction of image-level features, with low computational complexity, and can also avoid the problem of fusion of output-level surround results. Of course, feature-level transformation also has some typical problems, such as the use of some strange OPs, which makes it unfriendly during deployment. 

At present, the more robust Lift processes at the feature level are mainly based on depth and attention mechanisms, with BEVDet and DETR3D being the most representative. The depth-based strategy calculates the depth of each point in the image, and then projects the features into 3D space based on the camera's imaging model, thereby completing a Lift process. The attention-based strategy predefines an object in 3D space as a query, finds the image features corresponding to the midpoint in 3D space through internal and external parameters as the key and value, and then calculates a feature of an object in 3D space through attention.

All current algorithms are basically highly dependent on camera models, whether they are depth-based or attention-based, which leads to sensitivity to calibration and generally complex calculations. Algorithms that abandon camera models often lack robustness, so this aspect is not yet fully mature.

2.2 Temporal

Temporal information can effectively improve the effect of target detection. For the scenario of autonomous driving, temporal has a deeper meaning because the speed of the target is one of the main perception targets in the current scenario. The focus of speed is change. Single-frame data does not have sufficient change information, so modeling is needed to provide change information in the time dimension. The existing point cloud temporal modeling method is to mix multi-frame point clouds together as input, so that a denser point cloud can be obtained, making the detection more accurate. In addition, multi-frame point clouds contain continuous information. Later, during the network training process, BP learns how to extract this continuous information to solve tasks such as speed estimation that require continuous information.

The temporal modeling method of visual perception mainly comes from BEVDet4D and BEVFormer. BEVDet4D simply fuses a feature of two frames to provide a continuous information for the subsequent network. Another path is based on attention, providing a single time frame and counterclockwise features as a query object, and then querying these two features at the same time through attention to extract temporal information.

2.3 Depth

One of the biggest drawbacks of autonomous driving visual perception compared to radar perception is the accuracy of depth estimation. The paper "Probabilistic and geometric depth: detecting objects in perspective" uses the GT replacement method to study the impact of different factors on performance scores. The main conclusion of the analysis is that accurate depth estimation can bring significant performance improvements.

However, depth estimation is a major bottleneck in current visual perception. There are two main ways to improve it. One is to refine the predicted depth map using geometric constraints in PGD, and the other is to use LiDAR as supervision to obtain a more robust depth estimation.

The currently superior solution in terms of process, BEVDepth, uses the depth information provided by the lidar to supervise the depth estimation during the changing process during training, and performs it simultaneously with the main task of perception.

2.4 Multi-modality/Multi-Task

Multitasking is to complete multiple perception tasks in a unified framework, and through this calculation, it can save resources or accelerate computational reasoning. However, the current methods basically simply implement multitasking by processing features at different levels after obtaining a unified feature. There is generally a problem of performance degradation after task merging. Multimodality is also generally about finding a form that can be directly integrated in the entire judgment, and then achieving a simple fusion.

03 BEVDet Series

3.1 BEVDet

The BEVDet network is shown in Figure 2. The feature extraction process mainly converts a feature of the extracted image space into a feature of the BEV space, and then further encodes the feature to obtain a feature that can be used for prediction, and finally uses dense prediction to perform target prediction.

Figure 2 BEVDet network structure

The perspective change module process is divided into two steps. First, assume that the size of the feature to be transformed is VxCxHxW, and then predict a depth in the image space in a classified manner. For each pixel, a D-dimensional depth distribution is obtained. Then, these two features of different depths can be used to render a visual feature, and then the camera model is used to project it into the 3D space, voxelize the 3D space, and then perform the splat process to obtain the BEV feature.

A very important feature of the perspective change module is that it plays a role of mutual isolation in data augmentation. Specifically, through the intrinsic parameters of the camera, it can be projected into the 3D space to obtain a point on the camera coordinate system. When the data augmentation acts on the point in the image space, in order to maintain the coordinates of the point in the camera coordinate system unchanged, an inverse transformation is required, that is, a coordinate on the camera coordinate system is unchanged before and after augmentation, which plays a role of mutual isolation. The disadvantage of mutual isolation is that the augmentation of the image space does not regularize the learning of the BEV space. The advantage is that it can improve the robustness of BEV space learning.

We can draw several important conclusions from the experiment. First, after using the BEV space encoder, the algorithm is more likely to fall into overfitting. Another conclusion is that the augmentation of the BEV space has a greater impact on performance than the augmentation of the image space.

In addition, the object size in the BEV space is highly correlated with the category. At the same time, the small overlap length between objects will cause some problems. It is observed that the non-maximum suppression method designed in the image space is not optimal. The core of the acceleration strategy is to use parallel computing to assign independent threads to different small computing tasks to achieve the purpose of parallel computing acceleration. The advantage is that there is no additional video memory overhead.

3.2 BEVDet4D

The BEVDet4D network structure is shown in Figure 3. The main focus of this network is how to apply the features of the reverse time frame to the current frame. We select the input features as a retained object, but do not select the image features, because the target variables are defined in the BEV space, and the image features are not suitable for direct time series modeling. At the same time, we do not select the features after the BEV Encoder as the continuous fusion features, because we need to perform a continuous feature extraction in the BEV Encoder.

Considering that the features output by the perspective change module are relatively sparse, an additional BEV Encoder is connected after the perspective change to extract preliminary BEV features and then perform time series modeling. When fusion is performed, we simply align the features of the reverse time frame and then splice them with the current needle to complete the fusion of the time series. In fact, we have handed over the task of extracting the time series features to the subsequent BEV.

Figure 3 BEVDet4D network structure

How to design a target variable that matches the network structure? Before that, we must first understand some key characteristics of the network. The first is the receptive field of the feature, because the network is learned through BP, and the receptive field of the feature is determined by the output space.

The output space of the perception algorithm of autonomous driving is generally defined as a space within a certain range around the vehicle. The feature map can be regarded as a discrete sample with uniform distribution and aligned corners in this continuous space. Since the receptive field of the feature map is defined within a certain range around the vehicle, it will change with the movement of the vehicle. Therefore, at two different time points, the receptive field of the feature map has a certain offset in the world coordinate system.

If the two features are directly concatenated, the positions of the static target in the two feature maps are different, and the offset of the dynamic target in the two feature maps is equal to the offset of the self-test plus the offset of the dynamic target in the world coordinate system. According to the principle of pattern consistency, since the offset of the target in the concatenated features is related to the vehicle, when setting the learning target of the network, it should be the change in the position of the target in the two feature maps.

According to the following formula, we can deduce that a learning goal is unrelated to the movement of the self-test, but only related to the movement of the target in the world coordinate system.

The difference between the learning goal derived from the above and the learning goal of the current mainstream method is that the time component is removed, and speed is equal to displacement/time, but these two features do not provide time-related clues, so if we want to learn the speed goal, the network needs to accurately estimate the time component, which increases the difficulty of learning. In practice, we can set the time between two frames as a constant value during training. A constant time interval network can be learned by learning BP.

In the time domain augmentation, we randomly use different time intervals during the training process. At different time intervals, the offset of the target in the two images is different, and the learned target offset is also different, so as to achieve the robust effect of the model for different offsets. At the same time, the model has a certain sensitivity to the offset of the target, that is, if the interval is too small, the change between the two frames is too small to be perceived. Therefore, choosing a suitable time interval during testing can effectively improve the generalization performance of the model.

3.3 BEVDepth

This paper uses radar to obtain a robust depth estimation, as shown in Figure 4. It uses point clouds to supervise the depth distribution in the change module. This supervision is sparse, which is denser than the depth supervision provided by the target, but it does not achieve accurate depth supervision for each pixel, which is also relatively sparse. However, it can provide more samples to improve the generalization performance of this depth estimation.

Figure 4 BEVDepth network structure

Another aspect of this work is to divide the feature and depth into two branches for estimation, and add an additional residual network in the depth estimation branch to improve the receptive field of the depth estimation branch. The researchers believe that the accuracy of the camera's internal and external parameters will cause the context and depth to be misaligned. When the receptive field of the depth estimation network is not large enough, there will be a certain loss of accuracy.

Finally, the intrinsic parameters of the camera are used as the branch input of depth estimation. A method similar to NSE is used to adjust the channels of the input features at the channel level, which can effectively improve the robustness of the network to different camera intrinsic parameters.

04 Limitations and Related Discussions

First of all, the visual perception of autonomous driving ultimately serves deployment, which involves data and model issues. The data issue involves a diversity issue and data labeling, because manual labeling is very expensive, so we need to see whether automatic labeling can be achieved in the future.

At present, the annotation of dynamic targets is still unprecedented. For static targets, 3D reconstruction can be used to obtain partial or semi-automatic annotation. In addition, in terms of models, the current model design is not robust to calibration or is sensitive to calibration. So how to make the model robust to calibration or not dependent on calibration is also a question worth thinking about.

Another issue is the acceleration of the network structure. Can we use general OP to achieve changes in perspective? This issue will affect the process of network acceleration.