Principles that should be followed for microservice architecture monitoring

As software delivery methods change, the rise of microservice architecture makes software development faster and more flexible. In this case, the monitoring system becomes the core component of the microservice control system. As the complexity of software continues to increase, it becomes more difficult to understand the performance status of the system and troubleshoot problems in a timely manner. Therefore, the monitoring system also needs to keep pace with the times and undergo a comprehensive transformation to better adapt to the needs of the microservice environment and be able to function better.

The emerging microservice architecture has brought improvements in speed and flexibility to software development, thereby changing the traditional software development model. Among them, speed is one of the core requirements of microservices. In order to better meet this demand, the monitoring system also needs to be adjusted and improved accordingly.

Monitoring becomes especially critical in microservices architectures because as software complexity increases, we need to better understand the performance of the system and detect problems in a timely manner. Here are five guiding principles we developed to help shape our monitoring approach:

  1. Monitor containers and their internals: Microservices often run in containers, so monitoring the containers themselves as well as the various components and resources within them is crucial.
  2. Focus on service performance rather than container performance: In a microservices architecture, the focus should be on monitoring the performance and health of the service rather than simply monitoring the resource usage of the container.
  3. Monitor elastic and multi-location services: One of the advantages of microservice architecture is its elasticity and ability to be deployed in multiple locations, so the monitoring system needs to be able to effectively track and manage these characteristics.
  4. Monitoring APIs: Microservices often provide services in the form of APIs, so monitoring the performance and availability of these APIs is critical to ensuring the stable operation of the entire system.
  5. Mapping monitoring to the organizational structure: Design the structure of the monitoring system based on our organizational structure and team responsibilities to ensure that monitoring data can intuitively reflect the operating status of the entire organization.

By following these five principles, a more effective and comprehensive microservice monitoring system can be established to better cope with the technical and organizational challenges posed by the microservices architecture.

Below we elaborate on these five principles

Monitor containers and their internals

With the popularity of microservice architecture, container technology has become an important part of building microservices. Containers offer advantages such as speed, portability, and isolation, making it easier for developers to embrace the microservices model. The benefits of containers have been discussed extensively and I won’t go into details here.

However, to external systems, the container acts like a black box. This is very convenient for developers as containers can be easily deployed in different environments. But once the container starts running, this black box becomes a challenge when it comes to monitoring and troubleshooting service issues, because conventional monitoring methods often fail to work. We need to have a deep understanding of what is running inside the container: What programs and codes are running? How do they perform? Are there any important output metrics? From a DevOps perspective, we need to understand containers more deeply than just knowing they exist.

In non-container environments, a common monitoring method is to run an agent in the user space of the host or virtual machine. However, this approach is not suitable for container environments. One of the advantages of containers is that they are lightweight, they isolate various processes and reduce dependencies as much as possible.

Additionally, using thousands of monitoring agents at scale is an expensive waste of resources and a management challenge for even moderately sized deployments. For container environments, there are two potential solutions:

  1. Require developers to directly monitor their code: Make developers responsible for monitoring the performance and health of their code. This approach can give developers a deeper understanding of how their code is running, but may increase their workload and may not be comprehensive enough for comprehensive monitoring of the entire system.
  2. Leverage kernel-level instrumentation: Use a common kernel-level monitoring approach to view all application and container activity on the host. This approach reduces the number of agents and provides more comprehensive monitoring coverage, but may impact the performance of the host and may not provide detailed information related to each container.

Each method has its advantages and disadvantages, and choosing the method best suited to the environment depends on our specific needs and constraints.

Utilize business process systems to alert on service performance

Understanding running data in a container environment is not easy, and aggregating functions or services composed of multiple containers is much less complex than a single container.

This approach is particularly useful for application-level information, such as which requests have the shortest response times, or which URLs encounter the most errors. It is also suitable for architecture-level monitoring, such as which service containers are using more CPU resources than pre-allocated.

Increasingly, software deployments require the use of orchestration systems to translate the logical plan of an application into the deployment of physical containers. Common orchestration systems include Kubernetes, Mesosphere DC/OS and Docker Swarm. Teams can leverage orchestration systems to define microservices and understand the current state of each service. It could be argued that the orchestration system is even more important than the containers themselves, since containers are ephemeral and only exist as long as service needs are met.

DevOps teams should focus their alerts on service operating characteristics to be as close as possible to the actual experience of monitoring the service. If an application is affected, these alerts are the first line of defense in assessing the situation. But getting these alerts is not easy unless the monitoring system is based on the container itself.

Container-native monitoring solutions leverage orchestration metadata to dynamically aggregate container and application data and calculate monitoring metrics on a per-service basis. Depending on the orchestration tool used, you may want to perform in-depth monitoring at different levels. For example, in Kubernetes, there are usually Namespace, ReplicaSet, Pod and some other containers. Aggregating these different layers is necessary to troubleshoot logical problems, independent of the physical deployment of the containers that make up the service.

Monitor Elastic and Multi-Location services deployed in multiple places

Elastic services are not a new concept, but the pace of change is much faster in native container environments than in virtual environments. Such rapid changes can seriously affect the normal operation of the monitoring system.

Monitoring of traditional systems often requires manual adjustments based on software deployment. This tuning might be specific, such as defining individual metrics to capture, or configuring the data to be collected based on the application's operations in a specific container. In a small-scale environment (e.g., dozens of containers), we may be able to tolerate this manual adjustment, but in a large-scale environment, it will be unaffordable. Centralized monitoring of microservices must be able to automatically adjust as elastic services grow and shrink without manual intervention.

For example, if DevOps teams have to manually define which services need to be monitored, they are likely to miss out because platforms like Kubernetes or Mesos regularly create new containers every day. Likewise, if the operations team is required to install a custom status endpoint when deploying code to production, this will also create more challenges for developers to obtain the base image from the Docker repository.

In a production environment, setting up monitoring for complex deployments spanning multiple data centers or multiple clouds can be a challenge. For example, if the service spans private data centers and AWS, then Amazon's AWS CloudWatch will make this difficult to achieve. Therefore, we need to build a monitoring system that can span different regions and run in a dynamic native container environment.

Monitoring API

In a microservices environment, the API interface is common and is essentially the only component that exposes the service to other teams. In fact, API responsiveness and consistency can be considered an "internal SLA", even if a formal SLA (Service Level Agreement) has not yet been defined.

Therefore, monitoring of API interfaces is also essential. API monitoring can take different forms, but it's clear that it's definitely more than simple binary up and down checks. For example, it is valuable to know the most commonly used endpoints like time functions. This allows teams to see changes in service usage, whether due to design changes or changes in users.

You can also log the endpoints with the slowest service, which may reveal significant problems, or at least point to areas that need to be optimized in the system.

Finally, the ability to track system service responses is another very important capability that is primarily intended for developers, but will also help you understand the overall user experience. It is also helpful to divide the information into two parts based on the underlying and application perspectives.

Map monitoring to organizational structure

For those familiar with Conway's Law, the design of a system is based on the organizational structure of the development team. Creating faster, more agile software pushes teams to rethink their development organization and management disciplines. Therefore, if they want to benefit from this new software architecture (microservices), their team must map the microservices into the team's own structure. In other words, they need smaller, more loosely coupled teams that can choose their own direction as long as it meets overall needs. Within each team, they will have greater control over development language usage, bug submissions, and even job responsibilities.