The Savior of Architecture Upgrade! A Must-Have Guide for Traffic Playback Automated Testing

2024.10.16

Hello, everyone. I’m Xiaomi, a 29-year-old technology geek. Today I want to share with you a very practical skill in the field of Internet of Things - traffic playback automated testing.

In the daily development and operation and maintenance process, system upgrades and architecture transformations are inevitable, especially system-level refactoring. For the development team, the workload of testing regression after each major refactoring is quite huge, and it often takes months of work. If you accidentally fall into this pit, think about those huge test cases and complex scenarios. Just thinking about it will give you a headache, right? !

Challenges after system reconstruction

We often encounter the following scenarios:

Scenario 1: The read service is basically a stateless query, the state does not change, it is simple and lightweight, and data can be easily returned.
Scenario 2: Regardless of architecture upgrades or daily functional requirements, the external interface format of the read service generally does not change, that is, the input and output formats remain unchanged.

This seemingly simple scenario is one of the difficulties in system reconstruction. You will find that although there is no problem with the data itself, the reconstructed logic often contains invisible bugs, and these problems are usually exposed in the production environment. This raises a question - how to perform comprehensive automated testing without affecting online services?

Two common but difficult to implement solutions

When faced with refactoring, many companies come up with two common solutions:

Don’t make changes first, and then find a solution when the system can’t handle it anymore: This is a “hold it off” strategy. Problems will accumulate to the point of explosion, but it can easily cause the system to crash, which is not worth the cost.
Suspend demand and focus on transformation: Although this approach sounds ideal, in actual work scenarios, business needs will never stop. Suspending demand means losing market opportunities, delayed delivery and other problems.

The actual situation is:

Either strategy is difficult to implement. We cannot let the system crash, and it is impossible to completely stop business needs. This forces us to find a better way to deal with this problem.

Strategies for log collection and traffic playback

Here, Xiaomi would like to introduce a very practical solution: automated testing based on traffic playback. The core idea of this method is log collection and data playback.

Log Collection

First, we need to collect logs of real user requests. The role of logs is not only to store information, but more importantly, it can be used as a test case for regression testing.

How to collect logs? In the Spring framework, we can use Interceptor, and in Servlet, we can use Filter. For each request input and output parameter, we need to record them and send them to storage through message queue (MQ).

There are a few issues to note here:

Peak-shifting processing: Avoid performance issues caused by large-scale data influx in a short period of time.
Data filtering and deduplication: avoid redundant data from occupying storage space and ensure the validity of data.

Data playback

The next step is the key step - data playback. Through the historical request data collected by the log, we can play back the data of the reconstructed system, simulate the requests of real users, and perform automated testing.

Data playback can be divided into three modes:

Offline playback: Only the new service is called and the return result of the new service is compared with the original output parameters in the log. It does not directly affect the online system, but due to the large amount of logs, the storage requirements are high.
Real-time playback: Call the online system and the new system at the same time to make requests, and compare their return results in real time. The disadvantage is that it has a certain performance impact on the online system, so it is suitable for use when the system pressure is relatively low.
Parallel playback: The new version is not directly launched online. Instead, the new version interface is played back in parallel with the old version interface with a certain probability. The cycle is longer, but the impact is smaller. It is suitable for before the system is stably launched.

Difference comparison and bug location

What is the result of data playback? Ultimately, what we care about is whether we can find the bugs after reconstruction! By comparing the differences, we can automatically find those use cases that are inconsistent with expectations.

In this process, we can quickly find problems by comparing text. Since the input and output formats of the interface remain unchanged, we only need to compare whether there are differences in the specific data returned. For example, the values of some fields may be different in the new and old systems, which may be potential bugs. We can mark these differences and hand them over to developers for further location and repair.

Tips in practice

In the actual process of implementing traffic playback, Xiaomi has also accumulated some tips to help everyone use this tool better.

Log compression and storage optimization: The amount of log data generated by traffic replay is very large, so it is necessary to consider log compression and storage optimization strategies. We can compress the collected logs, or regularly clean up old logs that are no longer needed to save storage space. Integration of automated tool chains: Integrating traffic replay with existing automated tool chains can greatly improve testing efficiency. For example, combined with CI/CD tools such as Jenkins, playback tests are automatically triggered, differences are automatically recorded, and reports are generated. Combining grayscale release with traffic replay: During grayscale release, traffic replay can be used to perform parallel testing of new and old systems to detect potential problems in advance and ensure the stability of the new version.

Advantages of Traffic Replay

In general, traffic replay provides an automated, efficient, and low-interference solution for regression testing after system reconstruction. Compared with traditional manual testing, traffic replay has several obvious advantages:

Real user requests: Based on real user request logs, ensure comprehensive coverage of test scenarios.
Automated regression testing: reduces the workload of manual testing and improves efficiency.
Quickly discover problems: By comparing differences, you can find and locate bugs in a timely manner.
Low risk: The new version is not directly released online, thus avoiding interference with online business.

END

Traffic replay automated testing provides us with a solution that can both meet business needs and ensure system stability when dealing with system reconstruction. Through log collection, data playback and difference comparison, the development team can quickly locate problems, reduce the workload of regression testing, and greatly improve the efficiency of system upgrades.

新聞