DAGW: Exploration and Practice of Data Aggregation Gateway

2023.07.04

DAGW: Exploration and Practice of Data Aggregation Gateway


This paper conducts an in-depth analysis of the implementation technology of the most frequently visited video detail page on Station B and the problems caused by the continuous growth of fanout read, and proposes a solution for building a business association index, which can effectively reduce the service load by more than 90%. At the same time, a general data aggregation gateway (DAGW) solution is proposed and implemented for more aggregation display scenarios.

business background

Bilibili is a video community mainly based on PUGV. The main scene for users to use is to watch videos on the video details page. As the business develops and grows, there will be more and more expanded services on this "main battlefield", such as: topics, video honors, notes, user dressing, etc.

picturepicture

(Figure 1: All traffic will be aggregated to the video details page)

As can be seen from Figure 1, we can divide the functional pages on the APP into two categories: list pages (ListView Page), such as recommendation, search, dynamics, partitions, etc. Most of the pages are list-type, which provides users with Provides a wealth of content screening and preview scenarios; the other is the Detail View Page. When users click on the content they are interested in on any list page, it will be imported to the Detail View Page for viewing.

picturepicture

(Figure 2: The video details page gathers a variety of information and function entries associated with the video)

As can be seen from Figure 2, the video details page gathers the attributes and function entries related to the video, such as: popular, site-wide rankings, weekly must-see and other manuscript honors, video shooting templates, video collections, video soundtracks, and related topics And so on. These information and portals can help users further explore related topics and functions.

Current situation and problems

In terms of technical implementation, the user-oriented application architecture of Station B is mainly divided into four layers:

  • Terminal layer: Clients that directly interact with users, including mobile APP, H5, Web and client on PC, and other screen terminals, such as TV, car, stereo, PS, etc.
  • Access gateway: generally LB (Load Balance) plus AGW (API-Gateway), AGW is mainly responsible for request routing, protocol conversion, protocol offloading, current limiting fuse, security ban, etc.
  • BFF (Backend for Frontend): Due to the increase of terminals, in order to ensure that the client-specific logic can be better isolated, it is usually practiced to split the application according to the terminal, for example: web-interface (for web pages), app-interface ( For APP), tv-interface (for TV) and so on. In addition, as the page logic becomes more and more complex and the traffic is increasing, the BFF logic of the page will also be split into separate applications to isolate release and deployment, for example: app-feed (home page), app-feed view (video details page), etc.
  • Business Service: The interface responsible for the business domain or capability, usually split according to function/capability and business domain.

picturepicture

(Figure 3: Layered application architecture)

As can be seen from Figure 3, the main logic of the video details page is concentrated on the BFF layer. With the growth of DAU and the continuous expansion of business, we are faced with two problems:

Problem 1: The number of fanout reads increases with business expansion, which brings huge traffic load and complexity to BFF itself and downstream services. As shown in the figure below, in order to display the function entry of the associated video, the business service needs to carry the traffic of all video detail requests and the CPU resource consumption; on the other hand, it needs to implement a similar bloom filter mechanism to avoid all unrelated videos A large number of back-to-source queries brought by the request.

picturepicture

(Figure 4: The load is amplified to all services indiscriminately with the read spread of BFF, and brings complexity to the implementation of services)

Problem 2: Maybe we can solve problem 1 by adding machines and implementing complexity, but as the number of fanout reads continues to increase, the latency of a single video detail request will continue to deteriorate until it is unacceptable to users. (Figure 4.a [Reference 1], the increase in the number of fanouts will greatly increase the probability of the overall request timeout. Figure 4.b is the fanout request topology of the real Bilibili APP video details BFF, which is already relatively large (the picture can no longer be seen clearly) , and the number of fanouts continues to increase as the business increases.)

picturepicture

(Figure 4.a The correlation between the number of fanouts and the timeout rate, excerpted from "The Tail At Scale")

picturepicture

  1. (Figure 4.b The actual situation of BFF fanout on the video details page, drawn through the internal Trace system)

Analysis and Modeling

As mentioned above, the downstream business service of many video details only covers part of the video, that is, only part of the video has associated data, so a BloomFilter-like mechanism is often used to filter requests for unassociated videos.

We manage the Response size downstream of the video details BFF request in buckets (using Prometheus Histogram). After analysis, it is found that the Response returned by many business services shows the distribution shown in the figure below:

picturepicture

(Figure 5: BFF requests Service to return packet size distribution)

It can be seen that more than 90% of the service interface Response accessed by BFF is "empty", which means that the requested video is not associated with the service. However, in terms of implementation, the video detail BFF will request these services every time it obtains the video detail information. The fundamental reason is that the BFF layer does not know which services are associated with the video when processing the request.

If we can know in advance which services are associated with the requested video at the BFF layer, we can greatly reduce the number of BFF read spreads and the load of business services, and achieve on-demand access.

We can create a sparse vector containing its associated services for each video, called video-service index. As shown below:

picturepicture

(Figure 6: Index model of video id and associated business)

In actual implementation, the video service index does not necessarily store the relationship between videos and services in the form of sparse vectors, and some off-the-shelf kv systems can be used. For example, we use the hash key of redis to implement it. Another thing to consider is that when the relationship between business and video changes, there needs to be a full (initial stage) and incremental mechanism to notify the index service of the change.

accomplish

Based on the previous problem analysis and modeling, we optimized the architecture of video details BFF as shown in the following figure:

picturepicture

(Figure 7: Optimized architecture and processing flow)

In the BFF request processing flow, ① introduces the business association index service, obtains the index of the video related business before BFF requests the downstream business service, and ② acquires which business services should be accessed in this request in advance to filter out irrelevant business requests . The index is implemented through the hashmap of redis, and also uses the company's internal KV storage for persistence and redis fault degradation. An example of redis key setting is as follows:

HMSET index_vid1234 biz1 0 biz2 1 bizM "hot"
  • 1.

The index construction of the video-related business is constructed by importing the full amount + increment of the related information of the downstream business. In order to facilitate the downstream business to more efficiently import heterogeneous data into the index, we provide a set of background systems that support online cleaning of business change messages and writing of import functions. As shown below:

picturepicture

(Figure 8: Business change event processing function and index update push background)

schema extension

After our further research, we found that not only the video details, but also the story (short video), live broadcast, news and my page and other detail pages all present similar aggregated scenes, and these aggregated scenes as shown in Figure 3 will also appear in the APP at the same time , TV, Web and other terminals corresponding to the BFF. Is it possible to solve the aggregation problem of similar video details through a more standard and common solution?

As shown in Figure 3 above, the main processing logic of BFF is divided into: parameter processing, aggregation logic, and assembly of returned objects (VO). We can abstract complex aggregation logic such as video, live broadcast, and user into a more general aggregation service, which can be used by all BFFs. To do this, a generic aggregation service needs to have the following capabilities:

  1. Support different terminal BFFs to obtain aggregation models on demand.
  2. Support a more flexible extended aggregation model, that is, the cost of expanding a new business on the basis of satisfying 1 is as low as possible.
  3. Supports the previous ability to reduce load based on business association indexes.

Regarding point 1, common practices in the industry include the following:

  • GraphQL: Realize the filtering of required information through field selectors. Although GraphQL is comprehensive and flexible, its introduction will dramatically increase the complexity of system implementation and troubleshooting, which is not conducive to long-term maintenance and iteration. (see reference 2 for details)
  • Protobuf field mask: Google APIs proposes to specify the required return range by adding a field of type google.protobuf.FieldMask to the request parameter, aiming to reduce the network transmission sea and server-side computing costs caused by unnecessary return fields. However, Google APIs has announced that read_mask is deprecated.
  • View Enum: In order to meet the on-demand acquisition mechanism of the field mask, Google APIs provides a better alternative (see reference 3 for details). By defining View Enum, the service provider defines common on-demand access scenarios, for example: BASIC returns basic information and is used in list scenarios, and ALL is used to return details and is used in detail page scenarios. At the same time, it also supports richer enumeration definitions, which just fits our needs.

The following is our View Enum definition for the video details page:

enum ArchiveView {
    //未指定,不返回数据
    UNSPECIFIED = 0;
    // 以下是最常见场景的视图定义
    // 返回稿件简易信息(用于信息查询)
    SIMPLE = 1;
    // 返回稿件基础信息(可用于首页、搜索列表查询)
    BASIC = 2;
    // 返回稿件基础信息+分P信息(最简版详情,用于分享等场景)
    BASIC_WITH_PAGES = 3;
    // 返回APP端视频详情所有信息
    ALL_APP = 4;
    // 返回WEB端视频详情所有信息
    ALL_WEB = 5;
    // 返回TV端视频详情所有信息
    ALL_TV = 6;
    // 可以持续增加新的场景
}
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.

Regarding the second point, we abstract the aggregation logic into a DAG graph. The reason why we use the DAG model is that there are forward and backward dependencies between some business services, for example: some video attribute dependencies and video basic information (obtained by accessing the video basic information Service ) of the video's author information. In this way, any new business only needs to: 1. Specify other nodes that depend on, 2. Write the logic in the node, including accessing the Service service and business logic processing, 3. Configure which View Enum the node should be used in.

Regarding point 3, the implementation principle has been introduced earlier, we only need to expand the index from video-service index to live broadcast and user-service index.

To sum up, we named the general data aggregation service DAGW (Data Aggregate Gateway). The internal structure of DAGW and the interaction with BFF layer and Service are shown in the following figure:

picturepicture

(Figure 9: Introduce the general data aggregation gateway layer DAGW to meet the requirements of aggregation scenarios)

Effect

After the DAGW general data aggregation gateway and business association index went online, it supports video, user and other information aggregation capabilities. Nearly 30 business services have been connected and helped business services reduce traffic and load by more than 90% on average. The following are the high-energy highlights of the video and the user’s fan badge service access effect:

1. Among the traffic of the high-energy viewing service of video, the traffic from the playback page (app-view) reaches 100k+ QPS during the peak period, and the effect is very significant after being optimized by connecting to DAGW. It can be seen from the monitoring in the figure below that the request QPS is reduced 99%.

picturepicture

2. The fan medal is a wearable hardcore fan honor obtained by the user through long-term watching the live broadcast of the anchor and participating in the interaction. Because the threshold for obtaining it is high and it is only displayed under specific anchor content. After accessing DAGW, it can effectively reduce the amount by more than 85%. of access traffic.

picture

refer to

1. The Tail at Scale: https://research.google/pubs/pub40801/

2. GraphQL: From Excitement to Deception: https://betterprogramming.pub/graphql-from-excitement-to-deception-f81f7c95b7cf

3. View Enum: https://google.aip.dev/157