Architect Interview: How to Plan the Company's Monitoring Architecture?

Hello everyone, I am Brother Jun.

The monitoring system is very important in technology companies. It allows operation and maintenance personnel and R&D personnel to detect and locate problems in advance and then solve them.

In our actual work, the monitoring we use is often varied and confusing. Today, let’s talk about how to plan the company’s monitoring architecture.

1. Index collection

Monitoring indicator collection provides raw data for monitoring and is the basis of the monitoring system.

1.1 System indicators

When we use Prometheus for monitoring, we can use Node Exporter to collect system indicators for monitoring, such as memory, CPU, disk, file descriptors and other information.

1.2 Database/Middleware

Database and middleware failures can have a significant impact on business, and in extreme cases may even cause business to come to a halt. Therefore, supporting database and middleware monitoring is a must.

For databases and middleware, it is necessary to collect system information such as memory, CPU, and disk of the machine where the database is located.

For databases, it is also necessary to collect SQL execution time, database logs, etc. as monitoring indicators.

For middleware, you can collect throughput, average response time, and some indicators of the middleware itself, such as Kafka's ISRShrink/ISRExpand.

JVM related indicators, such as heap memory, FULL GC frequency and duration, and thread usage.

1.3 Business Indicators

The monitoring indicators of business systems can be very complex and large due to the complexity of the business.

The number of interface requests, average response time, and success rate. These indicators can be obtained by capturing network packets.

There are two ways to obtain the overall business operation status indicators. One is to obtain them by collecting business logs, and the other is to actively push the business code for execution, such as abstracting the operation status into indicators and saving them in the database, or sending them to the collection system through a message queue.

2. Index preservation

After collecting monitoring indicators, if you want to show these indicators to operation and maintenance personnel, you need to save the indicators first. Prometheus saves monitoring data in TSDB. Some companies choose to introduce external time series databases, such as VictoriaMetrics, which has been adopted by many companies.

The amount of data collected for some indicators is relatively small, and relational databases can be used for storage, which reduces the learning cost for R&D and operation and maintenance personnel.

Small companies can generally use open source tools to meet the needs of indicator preservation. In large companies, the business types are wide-ranging and the amount of indicator data is large, so it is necessary to plan indicator preservation plans and introduce multiple preservation methods.

3. Index processing

When monitoring only some specific indicators, as long as the data is collected and displayed normally, the monitoring goal can be achieved. However, this can only be used for simpler monitoring indicators, such as the number of requests for an interface.

From a macro perspective, the business side is more concerned with overall data, such as the number of successful transactions and the number of failed transactions yesterday. They only spend time monitoring the market. At this time, some data processing and aggregation is needed.

Therefore, indicator processing is also very important, and this work can also be handed over to the company's big data team.

4. Indicator display

After completing the indicator collection and processing, how to present it to the user clearly and elegantly is an important part of monitoring design.

Well-known monitoring tools such as Prometheus and Zabbix have mature visual interfaces that can clearly display information to users. However, these tools are difficult to meet the needs of more complex and demanding monitoring scenarios.

At this time, the technical team needs to develop its own indicator visualization tools, which not only include monitoring of the overall dashboard, but also monitoring pages for different personnel, such as those for business, operations and maintenance, and R&D.

5. Monitoring and alarm

With the collection, processing and display of indicators, we have only completed the preparation work of the monitoring system. Monitoring alarm is an important goal of our monitoring system planning.

The goal of monitoring and alarming is to enable relevant personnel to perceive problems in advance and take timely measures to prevent the problems from expanding.

In order to save system resources, monitoring alarms also need to be classified into different levels, and the alarm level should be planned according to the severity of the problem and the scope of business impact. This requires the business system to clarify the importance of the business in the demand stage and assist in determining the monitoring level.

For example, transaction systems, payment systems, and accounting systems involving money are very important to the company. They can be defined as serious levels. After the problem is monitored, the on-duty operation and maintenance personnel will be notified via SMS and OA messages. In this way, the on-duty operation and maintenance personnel can promptly notify the relevant R&D personnel according to the problem situation, and measures can be taken quickly even if the problem occurs in the middle of the night.

For business systems and batch transactions with less impact, they can be defined as major levels. When a problem occurs, SMS or OA can be used to notify the system manager in real time.

Transactions that have no business impact can be defined as secondary levels. When a problem occurs, developers only need to be notified by email, and there is no need for real-time notification. Once or several times a day will suffice.

6. Emergency Plan

Emergency plans need to be designed and rehearsed in advance to prevent accidents from happening. If a serious alarm is triggered at 2 a.m. and the R&D staff is woken up from their sleep, they will be in a panic if there is no emergency plan.

Emergency plans can be designed based on the actual situation of the system, such as restarting services, interface current limiting, circuit breaking, cluster expansion, and removal of faulty nodes.

7. Summary

The monitoring system is very important for technology companies. We can design it from the aspects of indicator collection, indicator storage, indicator processing, indicator display, monitoring alarm, and emergency plan. I hope this article will help you design the monitoring architecture.