What Are The "Three Measurements" of Distribution? Do You know?

In distributed systems, several key metrics include performance, resources, availability, and scalability. These metrics are critical because performance directly affects the system's responsiveness and throughput; resources measure the system's effective use of computing and storage resources; availability ensures that the system can still operate normally in the event of failures to meet user needs; and scalability is related to the system's ability to adapt in the face of increased load. Together, these metrics determine the overall efficiency and reliability of a distributed system.

Metrics for distributed systems

The origin of distributed technology shows that the emergence of distributed systems aims to solve the performance bottlenecks, insufficient resources, availability and scalability problems faced by single computers when processing complex, large-scale data and tasks through cheap ordinary machines. In other words, the goal of distributed systems is to use more machines to process larger amounts of data and more complex tasks. Therefore, performance, resources, availability and scalability have become important indicators of distributed systems, and it can be said that they constitute the core elements of distributed systems.

Performance

Performance indicators are mainly used to measure the ability of a system to handle various tasks. Whether it is a distributed system or a stand-alone system, performance requirements are an important consideration. Since different systems and services have different goals, the performance indicators of concern will also be different, and may even contradict each other. Common performance indicators include throughput, response time, and completion time.

Throughput refers to the number of tasks that a system can process within a certain period of time, and is a direct reflection of system performance. Common throughput indicators include:

  • QPS (Queries Per Second): The number of queries per second, used to measure the number of queries that the system can process per second.
  • TPS (Transactions Per Second): The number of transactions per second, used to measure the number of transactions that the system can process per second.
  • BPS (Bits Per Second): The number of bits per second, used to measure the amount of data processed by the system per second.

Response time refers to the time it takes for a system to respond to a request or input. It directly affects the user experience and is particularly important for latency-sensitive businesses.

Completion time refers to the total time required for the system to actually complete a request or process. One of the main purposes of the task parallel (or task distribution) mode is to shorten the completion time of the entire task, especially when it is necessary to process massive data or large-scale tasks, the user's perception of completion time is particularly significant.

Resource Usage

Resource usage refers to the hardware resources required for the normal operation of the system, such as CPU, memory, and hard disk. The resource usage when the system is not under any load is called no-load resource usage, which reflects the resource usage of the system itself. For the same function, the less no-load resource usage, the better the system design is, and the more likely it is to be favored by users. On the other hand, the resource usage when the system is under full load is called full-load resource usage, which shows the amount of resources required when the system is running at full capacity and reflects the processing power of the system. Under the same hardware configuration, the more services are running and the less resource usage is, the better the system design is.

Availability

Availability usually refers to the ability of a system to correctly provide services in the face of various anomalies. It is one of the important indicators of distributed systems and reflects the robustness and fault tolerance of the system. Availability can be measured by the ratio of the time the system stops serving to the total running time. For example, if the total running time of a website is 24 hours, and the time it is unavailable due to failures during this period is 4 hours, then its availability is 4/24=0.167, which means that it is unavailable about 16.7% of the time, or available 83.3% of the time. In addition, the availability of the system can also be measured by the ratio of the number of failures of a certain function to the total number of requests. For example, if there are 10 failures in 1,000 website requests, then the availability is 99%.

Scalability

Scalability refers to the ability of distributed systems to improve system performance (such as throughput, response time, and completion time), storage capacity, and computing power by increasing the size of cluster machines. This is also a major advantage of distributed systems. Its original design intention was to use the power of multi-machine clusters to solve problems that a single machine cannot handle. The number of machines required to complete a specific task, that is, the size of the cluster, depends on the performance of a single machine and the task requirements.

As business needs increase, in addition to vertical expansion by upgrading the performance of a single machine, another way is to expand horizontally by increasing the number of machines. Vertical expansion refers to enhancing the hardware capabilities of a single machine, such as adding CPU or memory, while horizontal expansion refers to increasing the number of computers. An ideal distributed system pursues "linear scalability", that is, a certain indicator can grow linearly with the increase in the number of machines in the cluster.

A common indicator for measuring system scalability is speedup, which is the performance improvement after system expansion relative to that before expansion. If the goal of expansion is to increase the throughput of the system, it can be measured by the ratio of the throughput after expansion to the throughput before expansion. If the goal is to shorten the completion time, it can be evaluated by the ratio of the completion time before expansion to the completion time after expansion.