How Does Online High-Precision Map Reshape the Autonomous Driving Industry? In-depth Thinking of the Tsinghua Team~

Written in front & the author's understanding:

Now all major manufacturers are planning light map (light HD map) or no map (no HD map) solutions, one of which is local map, assisted by SD map (standard map), so the overall grasp of this field is very practical. Local maps not only provide complex road network details, but also serve as the basic input for key tasks such as vehicle positioning, navigation and decision-making. Since standard definition maps (SD maps) are low-cost, easy to obtain and highly versatile, the perception method combined with SD maps as prior information shows significant potential in the field of local map perception.

Today, the Heart of Autonomous Driving shares with you a comprehensive overview and summary of the latest progress in integrating SD maps as prior information into local map perception methods. First, the task definition and overall process of integrating SD maps as prior information into local map perception methods, as well as related public datasets, are introduced. Subsequently, the representation and encoding methods of multi-source information, as well as the methods of multi-source information fusion, are discussed in detail. In response to this rapidly developing trend, this article provides a comprehensive and detailed review of the diverse research work in this field. Finally, the article explores related issues and future challenges in order to help researchers understand the current trends and methods in this field.

Introduction

Local map perception is a critical and challenging task in the field of intelligent driving. It involves a detailed understanding and real-time modeling of the vehicle's surroundings as the basis for decision-making and navigation in autonomous driving systems. Local maps not only provide information about roads and lanes, but also cover the detection and recognition of obstacles, traffic signs, pedestrians, and other dynamic or static objects. This information is critical to ensure safe vehicle operation and efficient path planning. Without accurate local map perception, autonomous vehicles may deviate from their routes, cause traffic accidents, and even threaten passenger safety. Therefore, local map perception plays an indispensable role in the autonomous driving ecosystem.

Unlike typical object detection, local map perception needs to maintain high accuracy under various lighting conditions and weather conditions while processing complex and dynamic environmental information. For example, shadows on the road, light reflections, dynamic obstacles, and occlusions by traffic signs can interfere with local map perception. In addition, sensor noise and data latency further increase the complexity of the perception task. Therefore, developing robust local map perception technology is crucial to achieving safe and reliable autonomous driving.

To address these issues, many researchers have proposed various methods. Chen and Lei proposed a method for visual positioning and map construction using ground texture, which improved the positioning accuracy and map update accuracy through global and local optimization. Other researchers [2] enhanced online map prediction and lane topology understanding by utilizing SD maps and integrating SD map information through Transformer encoders, thereby alleviating the problem of lane line occlusion or poor visibility and significantly improving the performance of lane detection and topology prediction. Researchers [3] proposed an innovative video lane detection algorithm that uses an occlusion-aware memory-based refinement (OMR) module to improve detection accuracy and robustness under occlusion using obstacle masks and memory information. RVLD improves the reliability of lane detection by recursively propagating the state of the current frame to the next frame and using the information of previous frames. In addition, there are methods such as Laneaf, LaneATT, and Streammapnet to alleviate these problems.

In previous autonomous driving research, high-precision maps (HDMaps) have always been crucial. HDMaps have absolute and relative accuracy within 1 meter and provide high-precision, fresh, and rich electronic map information, including extensive road and environmental information. These maps provide accurate navigation and positioning services for safe and efficient autonomous driving. However, HDMaps face major challenges, mainly in terms of real-time updates and cost control. Urban road environments change frequently, and any minor changes may affect the driving safety of autonomous vehicles. Traditional HDMap production methods require a lot of time and resources, and it is difficult to achieve real-time updates. Studies [8] and [9] have pointed out similar problems. In addition, the production and maintenance costs of HDMaps are extremely high, and the cost of using traditional methods can reach thousands of dollars per kilometer.

In this context, the "heavy perception, light map" approach has gained wide recognition in the industry. This approach emphasizes the use of on-board sensors for autonomous driving perception tasks, supplemented by lightweight map information. This strategy reduces the reliance on real-time map updates and reduces maintenance costs. At the same time, lightweight map information can effectively make up for certain limitations of on-board sensors and enhance the robustness of the model. As an electronic map widely used in traffic navigation and geographic information services, the Standard Definition Map (SD Map) has the characteristics of low production and maintenance costs, easy access and small data volume. It is suitable as a lightweight map to assist on-board sensors in building local maps for autonomous driving.

Despite the promising prospects of building local maps based on SD Maps, there are many challenges and a lack of comprehensive research reviews in this area. To address this gap, this review aims to provide a comprehensive overview of the latest progress in local map building methods using SD Maps. Specifically, the focus is on SD Map information representation methods and the application of multimodal data fusion techniques in local map perception tasks. This study deeply explores the main developments, challenges, and research directions in this field. The existing literature on local map construction based on SD Maps as prior information is reviewed. The advantages and disadvantages of these methods are analyzed, providing insights into their effectiveness and applicability in real-time autonomous driving applications. The focus is on the representation and encoding methods of various sensor information, as well as the fusion technology of multi-source sensor data, which is crucial for real-time local map generation. The basic principles, architectures, and performance of these methods are discussed, revealing their feasibility and practicality in this field. In addition, this paper identifies the key challenges and open research issues for local map construction using SD Maps as prior information.

Background

This section will clarify the definition of local map construction based on SD maps and summarize the general process of such tasks. It will introduce the composition and application scenarios of SD maps. Finally, it will list the public datasets and evaluation metrics commonly used in local map perception tasks.

Local map construction task definition based on SD map

The local map perception task involves creating an accurate map that represents the environment around the vehicle to support autonomous driving decision-making and planning. This task usually relies on data from a variety of sensors, including cameras, lidar, radar, and GPS. In addition, incorporating prior information from the SD map can enhance the robustness of the model and reduce the impact of the uncertainty of the on-board sensors on the model, thereby improving the overall model performance. The core of the local map perception task lies in real-time perception and understanding of the environment around the vehicle.

The general process of the neural network for local map construction can be summarized into several key components, as shown in Figure 1. After inputting the surround view image and lidar point cloud, the overall architecture of the local map construction network can be divided into different parts: the backbone network for image feature extraction, the PV2BEV (perspective view to bird's eye view) module for view conversion, the module for multimodal feature fusion, and the task-specific heads for lane detection. These components constitute the basic framework of the local map perception network. The image and point cloud data captured by the surround view camera and lidar are first processed by the backbone network to obtain (multi-scale) image features. These features are then converted to a bird's eye view through the PV2BEV module, fused with the SD map data through the modal fusion module, and finally output through different specific task heads.


Standard Definition Map

SD map (Standard Definition Map) is a digital map technology that provides basic geographic information and road network structure. It is widely used in daily navigation and geographic information services to provide convenience for users. SD map mainly provides the centerline skeleton of the road, but does not contain detailed lane information, road signs or other high-precision environmental features.

For local map building tasks, SD maps offer three main advantages. First, SD map data is easy to obtain. It is usually available for free from open geographic data sources such as OpenStreetMap, which is suitable for large-scale applications. Second, compared with HD maps, the production and maintenance costs of SD maps are significantly reduced. Finally, SD maps are highly versatile, covering most types of roads, and can provide relevant road information for local map building tasks. Platforms such as OSM and Baidu Maps can serve as data sources for SD maps. For example, OpenStreetMap (OSM) is a collaborative project created and maintained by volunteers around the world, providing free, editable, open content maps. OSM data includes a wide range of geographic information, such as roads, buildings, parks, and rivers, which users can freely access, edit, and use.

Dataset

In the field of bird's-eye view (BEV) local map construction, commonly used datasets include KITTI, nuScenes, ApolloScape, Argoverse, Openlane and Waymo open datasets.

The KITTI dataset created by Karlsruhe Institute of Technology and Toyota provides stereo camera, lidar, and GPS/IMU data, covering urban, rural, and highway scenes, suitable for tasks such as target detection, tracking, and road detection. The nuScenes dataset released by Motional includes data from six cameras, five radars, one lidar, IMU, and GPS, suitable for urban traffic scenes under various weather and lighting conditions. The ApolloScape dataset released by Baidu provides high-precision 3D annotation data covering various urban road scenes, suitable for tasks such as lane detection and semantic segmentation.

The Argoverse dataset released by Argo AI includes stereo camera, lidar, GPS and IMU data, provides detailed 3D annotations and lane markings, and is mainly used for 3D object detection and lane detection. The Waymo open dataset released by Waymo covers a variety of weather and traffic conditions, provides high-quality data from lidar and cameras, and is suitable for tasks such as 3D object detection, tracking and lane detection.

OpenLane-V2 (also known as OpenLane-Huawei or Road Genome) is a benchmark dataset for road structure perception in next-generation autonomous driving scenarios, jointly open-sourced by Shanghai Artificial Intelligence Laboratory and Huawei Noah's Ark Laboratory. It is the first dataset that contains the topological relationship of road structures in traffic scenarios.

The ONCE-3DLanes dataset is a real-world autonomous driving dataset with 3D spatial lane layout annotations. It is a new benchmark dataset built to promote the development of monocular 3D lane detection methods. The dataset was collected in multiple geographical locations in China, including highways, bridges, tunnels, suburbs, and urban areas, covering different weather conditions (sunny/rainy) and lighting conditions (day/night). The entire dataset contains 211K images and their corresponding 3D lane annotations in the camera coordinate system.

CurveLanes is a new benchmark lane detection dataset containing 150,000 lane images for difficult scenarios such as curves and multiple lanes in traffic lane detection. The dataset was collected in real urban and highway scenes from multiple cities in China. All images are carefully selected, and most of them contain at least one curved lane. More challenging scenarios such as S-curves, Y-shaped lanes, nighttime, and multiple lanes can also be found in this dataset.

Common evaluation indicators

Evaluation Metrics for Lane Extraction

mAP is a common metric for evaluating the performance of object detection models. mAP measures the accuracy of the model at different threshold levels by matching the predicted bounding box with the true box to calculate the positive detection (TP), false detection (FP), and missed detection (FN). First, the predicted box is matched with the true box according to the specified intersection over union (IoU) threshold. Then, the precision (TP / (TP + FP)) and recall (TP / (TP + FN)) of each category are calculated, and the precision-recall curve is plotted. The area under the curve is calculated by interpolation to obtain the average precision (AP) of a single category. Finally, the average of the AP values ​​of all categories is mAP, which reflects the overall detection performance of the model. The higher the value, the better the performance.

Mean Intersection over Union (mIoU) is a commonly used metric for evaluating the performance of semantic segmentation models. mIoU measures the model's pixel-level classification accuracy for various objects. The calculation process involves several steps. For each category, the IoU is calculated by dividing the number of intersection pixels (Intersection) between the predicted area and the true area by the number of pixels (Union) of the union of these areas. This calculation is performed for each category, and then the mean IoU of all categories is mIoU, which provides an average performance evaluation of the model's segmentation accuracy, with higher values ​​indicating better segmentation performance.

Traditional object detection metrics, such as mAP, may not fully capture all important aspects of the detection task, such as the estimation of object velocity and attributes, as well as the accuracy of position, size, and orientation. Therefore, nuScenes Detection Score (NDS) is proposed to comprehensively consider these factors. NDS integrates multiple key metrics, overcomes the limitations of existing metrics, and provides a more comprehensive performance evaluation.

The calculation formula of NDS is as follows:

In this formula, mAP stands for mean Average Precision, which is used to measure detection accuracy. The TP set contains the average values ​​of five true positive metrics: ATE (average translation error), ASE (average scale error), AOE (average orientation error), AVE (average velocity error), and AAE (average attribute error).

Evaluation Metrics for Topological Reasoning

OpenLane-V2 divides the task into three subtasks: 3D lane detection, traffic element recognition, and topology reasoning. The overall task performance is described by the OpenLane-V2 score (OLS), which is the average of each subtask metric. The metric for 3D lane detection, DETl, can be expressed as the average AP [formula] at different thresholds, where AP is calculated using the Fréchet distance. Traffic element detection is similar to object detection and is evaluated using AP with an IoU threshold of 0.75. Traffic elements have multiple attributes, such as the color of traffic lights, which are closely related to the passability of the lane, so the attributes must also be considered. Assuming A is the set of all attributes, the evaluation includes attribute classification accuracy.

OpenLane-V2 uses the TOP score to evaluate the quality of topological reasoning, similar to the mAP metric, but adjusted to fit the structure of the graph. Basically, this transforms the topology prediction problem into a link prediction problem and evaluates the algorithm performance by calculating the average AP of all vertices. The first step is to determine a matching method to pair the real and predicted vertices (i.e., centerlines and traffic elements). For centerlines, the Fréchet distance is used; for traffic elements, the IoU is used. When the confidence of the edge between two vertices exceeds 0.5, they are considered to be connected. The vertex AP is obtained by sorting all the predicted edges of the vertex and calculating the average of the cumulative precision.

Multimodal Representation

Image data

In the perception task of bird's eye view (BEV), the image information of the panoramic camera is the most important input data. The common feature extraction methods of panoramic images follow the paradigm of BEVformer or LSS for autonomous driving perception tasks. The backbone module of the neural network extracts 2D image features from various camera angles through classic and lightweight convolutional networks such as ResNet-50 or 101, Mobilenets, EfficientNet, V2-99, etc. Among them, the ResNet series is widely used because it solves the gradient vanishing problem in deep neural networks by introducing residual blocks during training. Variants such as ResNet enhance the feature extraction capability by increasing the depth and width of the network. Due to their outstanding performance in image recognition and feature extraction, these networks are widely used in BEV local map perception tasks. Usually, a feature pyramid network (FPN) module is attached to the backbone module. FPN integrates feature maps of different scales to generate more powerful multi-scale feature representations. This seems to be the default basic configuration, and the number of fusion layers can be selected according to the network type. This multi-scale feature fusion helps improve the detection and recognition of objects of different sizes, thereby enhancing the overall performance.

In addition to these lightweight and simple backbone networks, larger-scale backbone networks will become the mainstream trend in the future. With the success of Transformer in the field of computer vision, Transformer-based feature extraction methods have also been applied to BEV local map perception tasks, such as Swin. Referring to the methods on the Nuscece leaderboard, the most advanced methods all use the pre-trained VIT-L as the backbone network, or its variant EVA-02. Although the large number of parameters and high computational complexity of large models may seriously affect the inference speed, these large pre-trained backbone networks are the key to improving model performance. Nevertheless, their performance directly drives the improvement of detection accuracy. The training of these large models requires massive data support, but the data annotation cost is high and limited, and self-supervised training methods will become mainstream. With the widespread application of BERT pre-trained models in various self-supervised tasks in natural language processing and the demonstration of powerful language representation learning capabilities, in self-supervised learning in computer vision tasks, MAE randomly masks patches on images and realizes self-supervised learning of masked images. The achievements of MIM-based pre-training algorithms are booming in the field of computer vision. Such self-supervised pre-training models can not only solve the problem of high-cost labels, but also better learn the representation relationships of images.

Whether based on CNN or Transformer methods, the ultimate goal is to obtain high-quality panoramic image feature representation. For BEV local map perception tasks, feature representation is crucial because it directly affects the accuracy and robustness of the perception system. The global feature extraction mechanism of FPN modules or Transformer can significantly improve the overall performance of the network, making it more effective in perception and decision-making in complex driving environments.

LiDAR point cloud data

In the local map perception task of BEV, in addition to using a pure vision surround camera as a single data input, the multimodal method also fuses multimodal information such as lidar point cloud and camera data to perform depth-aware BEV transformation. Compared with the single vision method and the multimodal (RGB+LiDAR) method, the multimodal fusion method excels in accuracy despite the additional computational complexity. The processing of lidar point cloud data is a key step in the multimodal perception task. The feature extraction of lidar point cloud data in P-mapnet first requires voxelization of the point cloud, and then a multi-layer perceptron (MLP) is used to extract the local features of each point. Max pooling selects the maximum eigenvalue among multiple local features to form a global feature representation, enhancing the model's global perception ability of point cloud data.

Given the lidar point cloud P and panoramic image I, the formula is as follows:

Among them, represents the feature extractor, which extracts multimodal input to obtain BEV features, and represents the decoder, which outputs the detection results.

The MapLite 2.0 method further integrates lidar point cloud data with other sensor data and with rough road maps obtained from SD maps such as OpenStreetMap, using the rough route information in the SD map to optimize the geometry and topology of the road. This not only improves the accuracy of the map, but also enhances the understanding of complex road environments. It is also used to generate high-definition maps online by projecting lidar intensity data from a bird's-eye view. By integrating multimodal data, not only detailed spatial information is provided, but also accurate semantic segmentation of the driving environment is achieved.

SD map data

In the context of enhancing local map perception tasks, integrating SD map information as prior knowledge can significantly improve the performance of vision and lidar sensors, especially in long-range and occluded scenarios. In order to effectively integrate SD maps into the network structure while preserving its unique road information, various representation forms have been explored. SD maps can generally be divided into two forms: raster and vector.

Figure 2 shows an example of an SD map, illustrating how different forms of SD map representations can be utilized to supplement the local map building process and thus enhance the overall performance of the perception system.

The feature extractor can contain multi-modal data. Here S is the SD map prior knowledge in the form of road centerline skeleton. Where, represents the feature extractor, extracts multi-modal input to obtain BEV features, represents the decoder, and outputs the detection result.

Raster representation

MapLite2.0 first introduced SD maps into the local map perception task. PriorLane models the map as a binary image, where 1 represents a drivable area and 0 represents an undrivable area. Similarly, MapVision also uses a one-hot encoding method, and then concatenates the position encoding information and extracts SD map features through an encoder. The SD map is aligned with the ego-vehicle data through the KEA module proposed in the article, and then fused with the sensor data to obtain a hybrid expression. Both P-MapNet and MapLite2.0 use rasterization to represent the SD map, but the difference is that after rasterizing the SD map, P-MapNet uses a CNN network to extract information from it as an additional source of information (i.e., key and val) for BEV feature optimization; MapLite2.0 uses the SD map as the initial estimate of the HD map, converts it to the BEV perspective, and combines it with the image input by the sensor. It is trained through a convolutional neural network to predict its semantic label. Finally, these semantic segmentation results are converted into a distance transform for a specific label, and a structured estimator is used to maintain the local map estimate and integrate the SD map prior knowledge.

Vector representation

SMERF first proposed a Transformer-based encoder model for inferring road topology. MapEX and SMERF have similar representations for map elements, introducing a polyline sequence representation and a Transformer encoder to obtain the final map representation of the scene. Specifically, the roads in the SD map are first abstractly represented in the form of polylines. For the polyline data, N data points are obtained by uniform sampling. Then, after sine-cosine encoding, an N-dimensional line description is obtained. Consider a vertical line with a small curvature, where all its points have very similar x or y axis values. Directly inputting the coordinates of these points into the model may lead to insufficient distinction of curvature.

Therefore, using sinusoidal embeddings will make this difference more obvious, thereby improving the model's ability to interpret these features. In practice, the coordinates of each line will be normalized to the range of (0,2π) relative to the BEV range, and then the coordinates of each line will be embedded. These encoded data will pass through several layers of Transformer networks to obtain map feature representations.

Coding of other information

SMERF: In addition to encoding the polyline coordinates of the SD map, SMERF also uses one-hot encoding to encode the road type as a vector of dimension K (the number of road types). For ground elements within the perception range, M * (N * d + K) encoded data will be obtained, which will be transformed through several layers to obtain map feature representation. Ablation experiments show that adding more road type information can improve the effectiveness of lane detection and road topology inference.

Multimodal Fusion Method

Among the methods that take images as input, such as MapTR based on the encoder-decoder architecture, a classic paradigm for local map construction was established, paving the way for subsequent methods. Streammapnet further enhances the performance in occluded areas by integrating comprehensive temporal information. 3D LaneNet adopts an end-to-end learning framework to integrate tasks such as image encoding, spatial transformation between views, and 3D curve extraction into one network. Gen LaneNet proposes a two-stage framework to decouple the learning of the image segmentation subnetwork and the geometric encoding subnetwork. In addition, some monocular 3D lane detection methods only focus on visual images as input. Many models also rely only on visual images. On the other hand, HDMapNet, as a representative multimodal method, achieves effective fusion of multi-sensor data by encoding lidar point clouds and predicting vectorized map elements from a bird's-eye view. In addition, other models also use lidar point cloud data as additional input. Figure 3 shows the development trend of local map construction in recent years. Considering the cost of building high-precision maps, Maplite 2.0 took the lead in introducing SD maps into local map perception tasks. MapEX solves the problem of incomplete or inaccurate existing map information by converting existing map elements into non-learnable queries and combining them with learnable queries for training and prediction. SMERF and P-MapNet combine the feature representation of SD maps with camera input features and use a multi-head cross-attention mechanism to make lane topology inference more effective.


In order to achieve effective fusion of visual BEV features and SD map semantic information, BLOS-BEV explored various feature fusion methods. In addition, methods such as PriorLane, FlexMap, Bayesian, TopoLogic, LGMap, MapVision, RoadPainter, and EORN integrate SD map priors into local map construction, and this trend is gradually gaining attention. Before fusion, perspective conversion is required. This section focuses on converting feature information extracted from 2D camera sensor images, commonly referred to as perspective views (PV), into BEV features. Local map perception tasks usually regard the ground as a plane and build maps in a bird's-eye view, because on the one hand, BEV helps the fusion of multi-sensor information, and existing advanced BEV target detection work can provide a good foundation. The conversion methods from PV to BEV include geometry-based methods and network-based methods. Geometry-based methods can be divided into two types: isomorphic transformation and depth estimation. Network-based methods can be divided into MLP-based methods and Transformer-based methods. Transformer-based PV to BEV conversion can usually be directly implemented through the BEV perception model. MapTR in Figure 4 proposes an optimized GTK module based on the View Transformer module in BEVFormer.


Alignment

Due to the inherent errors of GPS signals and the influence of vehicle motion, the vectorized and rasterized SD map priors are inevitably spatially misaligned with the current BEV space, making it difficult to fully align them. Therefore, before fusion, it is necessary to spatially align the SD map prior with the current BEV operating space. FlexMap uses SLAM trajectories and corrected RTK trajectories to calculate offsets and achieve spatial alignment. To solve this problem, PriorMap sets up a KEA (knowledge embedding alignment) module to embed the SD map prior knowledge and spatially align it with image features. Specifically, a feature extraction network is first used to extract feature points from the image, and feature points are extracted from the SD map prior knowledge.

Subsequently, these feature points are spatially matched using an alignment algorithm based on an attention mechanism. Finally, the aligned feature points are further processed by fusing a Transformer network, which enhances the accuracy and robustness of the local map perception algorithm. Similarly, P-MapNet first downsamples the rasterized SD map prior and then introduces a multi-head criss-cross attention module, which enables the network to use criss-cross attention to determine the most appropriate alignment position, thereby effectively enhancing the BEV features using the SD map prior. As shown in Figure 5, the ablation experiment of P-MapNet shows that directly connecting the SD map prior information can still improve the model performance even in the case of weak spatial alignment with BEV. On this basis, adding a CNN module and a multi-head criss-cross attention module can further improve the model performance. This demonstrates the important role of SD map prior information in the local map perception task, and simply adding a rasterized SD map prior can improve the model performance even without strict alignment.

Fusion

After obtaining the feature representation of multi-sensor data, fusion processing is required to obtain a stronger feature representation.

In order to align the features of different sensors, it is necessary to achieve fusion at the BEV level. The image BEV features are obtained from the surrounding images through the perspective conversion module. In SMERF, the SD map features interact with the BEV features through the cross-attention mechanism. First, the BEV features are encoded into a query vector and initialized through the self-attention mechanism. Given the SD map of the scene, LGMap uniformly samples a fixed number of points along each polyline as shown in Figure 6. In the case of sinusoidal embedding, BEVFormer applies cross-attention to the SD map feature representation with the features from the visual input at each encoder layer. The SD map features are encoded into key and value vectors, and then the final fused BEV features of the camera and SD map are obtained through cross-attention calculation.


In addition to the common attention mechanism fusion method, BLOS-BEV, as shown in Figure 7, explores different fusion schemes to combine visual BEV features with SD map semantics to achieve optimal representation and performance, exploring three SD map fusion techniques: addition, concatenation, and cross attention. Although all fusion methods perform better than those without SD maps, cross attention fusion of SD maps performs best on the nuScenes and Argorse datasets, showing excellent generalization performance and excellent performance on long distances (150-200 meters).


In P-mapnet, point cloud information has been added, and the lidar point cloud has been voxelized and MLP processed to obtain the feature representation of each point, thereby obtaining the lidar BEV. The fusion of image BEV and lidar BEV is used to obtain further fused BEV features. Further convolution downsampling of the fused BEV features can alleviate the alignment problem between image BEV features and lidar BEV features.

Through the cross-attention mechanism, the good features of the SD map interact with the fused BEV features, and finally the BEV features of the camera and lidar point clouds are fused. Similarly, MapVision and MapEX use SD map features as keys and values, and feature maps formed from multi-view images as queries to perform cross-attention as shown in Figures 8 and 9.


To address the inaccuracies that may result from issues such as occlusion and limited perceptual range, RoadPainter proposes a novel SD map interaction module, which Figure 10 shows effectively enhances BEV features by incorporating information outside the visual range. The EORN, as shown in Figure 11, rasterizes the SD map and generates an SD map in BEV. The SD encoder based on ResNet-18 extracts SD map features. The SD map features are then interpolated and concatenated along the channel dimension with the BEV features from the image BEV. The fusion method uses a simple two-layer convolutional neural network ConvFuser, which fuses the concatenated features and outputs the fused BEV features. Another approach involves a graph encoder that fuses the SD map graph with the BEV features and combines the output of a centerline deformable decoder using a multi-head attention mechanism. The subsequent decoder can query, compute, and output corresponding results for different tasks from the BEV features that contain rich information.


Conclusion and discussion

Challenges and Future Prospects

  1. Improvement of SD map encoding and processing methods Appropriate encoding and processing methods are crucial to exploit SD map prior information in local map perception tasks. Current studies have adopted relatively simple encoding and processing methods, whether using raster or vector representation. Future studies can explore more efficient encoding and feature extraction methods.
  2. Improvement of the alignment between SD map prior information and BEV space Due to the accuracy limitation of GPS sensors, it is challenging to perfectly align the SD map prior information with the current BEV operation space. This spatial misalignment may affect the detection accuracy of the model to some extent. Improving the spatial alignment method can further improve the model performance. Future research can consider integrating temporal information to improve the alignment accuracy between the SD map prior information and the BEV space.
  3. Inference of road topological relations The topological relations in local maps can be divided into two branches: the topological relations between roads (mainly representing road connectivity) and the topological relations between roads and traffic signs (including traffic control signals and other directional signs). Enhancing scene understanding of the road environment is crucial for advanced autonomous driving tasks. The OpenLane-v2 dataset is the first public dataset that provides topological relations between roads and between roads and traffic signs. Currently, research in this area is still limited. Future work can use graph neural network models to model the topological structure of road networks and scene understanding tasks of traffic signs.
  4. Integrate more SD map prior information Existing studies have shown that incorporating more road type information can enhance model performance. However, in addition to the basic road network location and road type, SD maps can also provide richer prior information. For example, OpenStreetMap provides additional information such as the number of lanes, lane directions, and road topology relationships. Future research can try to integrate this diverse information into SD map priors to further enhance the robustness and accuracy of local map perception models.

In conclusion

This paper reviews the literature on local mapping using SD maps, highlighting the key role of SD maps in this task. The definition and core aspects of local mapping using SD maps are introduced, showing its importance in developing accurate and reliable maps. Commonly used public datasets and their corresponding evaluation metrics are listed.

The main processes of leading technical methods are summarized, focusing on the representation and encoding methods of data from different sensors (such as lidar, camera and radar). The advanced fusion techniques for multi-source sensor data integration and their respective advantages and disadvantages are explored.

The evaluation prospects and design trends of local map building models are discussed, including addressing emerging challenges such as improving the alignment of SD maps with the BEV perspective and enhancing encoding and processing methods. The potential of incorporating detailed SD map prior information to model road topology relationships is considered, aiming to improve scene understanding capabilities and support advanced automated driving tasks.