Five-minute technology talk | Distributed UUID generation strategy and application scenario analysis

2023.07.15

Five-minute technology talk | Distributed UUID generation strategy and application scenario analysis


UUID is a concept proposed by the International Organization for Standardization (ISO). UUID is used to identify the attribute type and is regarded as a unique identifier in all space and time. This article will introduce the composition method, current version, generation strategy, and application cases of UUID.

Part 01

What are UUIDs 

The full name of UUID is Universal Unique Identifier, which is a string of 128-bit digital codes used to uniquely identify network objects or events. Due to its unique generation mechanism and usage scenarios, UUID can ensure global uniqueness and avoid duplication. UUID is widely used in various scenarios that require unique identification, such as database primary keys, system instance IDs, and identification of Bluetooth configuration files and objects with short life cycles.

UUID is a term similar to GUID. GUID originally introduced by Microsoft is actually a variant of UUID. The two terms are defined as synonyms in the RFC 4122 specification. Subsequently, the Open Software Foundation (OSF) standardized UUID, making it an important part of a distributed computing network, and all derived UUID versions follow the RFC 4122 specification.

Part 02

The composition of UUID 

But if the UUID is usually generated by a specific algorithm, such as based on a timestamp or a network address, etc. UUID ensures its uniqueness through a specific combination arrangement, consisting of 32 hexadecimal numbers (including numbers 0 to 9 and letters A to F) and 4 hyphens. The number of characters per hyphen is 8-4-4-4-12, where the last 4 or N bits indicate the format and encoding:

picture

UUIDs can also be represented in decimal or binary format:

picture

Traditional UUIDs are roughly divided into 3 variants:

➤ Variant 0: Reserved for compatibility with the Apollo network computing system that was not outdated in the 1980s, its structure is similar to the currently used version 1.

➤ Variation 1: The variant that is mainly used at present is defined as RFC 4122/DCE 1.1 UUID or Leach-Salz UUID in the Internet engineering document specification. For example, Microsoft's GUID is the variant 1 of UUID.

➤ Variant 2: Reserved for compatibility with Microsoft's subsequent development. Although the existing Microsoft GUID is Variant 1 of UUID, the early Windows platform uses Variant 2. Variant 1 and Variant 2 differ in the number of bits at the N-bit position, e.g. Variant 1 uses 2 bits, while Variant 2 uses 3 bits.

Part 03

The current version of the UUID

The existing UUID is mainly derived based on variant 1, and consists of five different versions. The UUID generation methods of different versions are different, including:

➢ Version1: UUID generated based on timestamp and nodes. It uses the current time and your computer's MAC address to generate a unique identifier. This version of UUID guarantees global uniqueness and temporal ordering, but may have security and privacy concerns in some cases.

➢ Version2: UUID generated based on DCE Security Identifier (DCE Security Identifier). This version of UUID encodes the identifier's role and permission information into the UUID, enabling it to represent users, groups, and ACLs (Access Control Lists). However, this version of UUID is uncommon and not widely supported.

➢ Version3: UUID generated based on name and namespace. It processes names and namespaces using the MD5 hash function to generate unique identifiers. This version of UUID can be used to identify objects in a namespace, such as URLs, domain names, etc.

➢ Version4: UUID based on random number generation. This version of UUID is generated using a random number generation algorithm, so it is highly random and unique. It is currently the most commonly used UUID version and is widely used in various fields.

➢ Version5: UUID generated based on name and namespace. It is similar to Version 3, but uses the SHA-1 hash function, which provides a stronger hash algorithm. This version of UUID can also be used to identify objects within a namespace.

Part 04

Generation strategy for existing UIDs 

4.1 Mysql generated ID

Mysql uses the primary key auto_increment method to generate IDs, and the step size between IDs is fixed, but the step size can be customized. This method is simple and easy to use, and can guarantee the increment and uniqueness of the ID, but there are problems of single point of failure and data consistency, and there are certain challenges in terms of scalability.

4.2 MongoDB generates ID

The UUID generated by MongoDB is composed of 12-byte hexadecimal numbers, specifically: 4 bytes-time stamp in seconds, 3 bytes-machine identifier, 2 bytes-process ID,3 bytes -- counter (starts with a random value). Shorter than traditional UUIDs, but longer than MYSQL autoincrement fields (64-bit Bigint values).

4.3 Redis generates ID

Redis usually uses the atomic operation INCR or INCRBY to implement ID generation. If it is a Redis cluster, you can set the ID initial value and customize the step size of each node ID. Since Redis is a single-threaded model, the uniqueness of the generated ID can be guaranteed.

4.4 Zookeeper generates ID

Zookeeper mainly generates an ID through the ZNODE data version. This ID is usually composed of 32-bit or 64-bit strings. The client can use this ID as a UID. Due to the need to rely heavily on Zookeeper, it is rarely considered to use this method in the case of many high-concurrency scenarios.

Part 05

Application case analysis of UUID 

5.1 Twitter generates UUID

Twitter uses the snowflake algorithm as a professional service to uniformly generate 64-bit unique identifiers for object identification in distributed systems, such as tweets, direct messages, lists, etc. These IDs are unique unsigned 64-bit integers based on time. The main composition of the complete ID definition is as follows:

  • 41 bits -- Timestamp in milliseconds (69 years relative to any custom epoch usually works)
  • 10 digits -- Configured machine/node/shard ID
  • 12 bits -- serial number (local counter for each machine, set to 0 after every 4096 values)
  • 1 bit--an extra reserved bit, generally set to 0 to ensure that the total is a positive value

picturepicture

UIDs constructed in this way not only improve usability, because the timestamp is used as the first part, but also can be sorted according to time. By default, the snowflake algorithm generates a 64-bit unsigned long integer, which is an ID of length 19. Sometimes it may not be so long in a specific project, and the algorithm can also be modified according to its own needs.

5.2 Baidu generates UUID

Baidu improved the generation of UID based on the snowflake algorithm, adjusted the bits of the UID, and modified the timestamp part to 28 bits, which is used to represent the time difference between the current time and the time generated by the initial online generation (in seconds), where the initial time can be manually configured . Use 22 bits to represent the working node ID, generate a distributed ID when the instance starts, and write the self-increasing sequence value obtained in the database. Use a 13-digit serial number to solve problems such as time callback. If the current time is the same as the last time, the sequence will auto-increment, and if it exceeds the threshold, it will spin.

picturepicture

5.3 Meituan generates UUID

Based on the improvement of the snowflake algorithm, Meituan uses the method of "1+41+5+5+12" to form a UID. On the basis of the original snowflake algorithm, it uses 5 bits to represent the machine ID, 5 bits to represent the computer room ID, and 12 bits for the serial number Represents self-increment. Due to the large cluster, the machine ID is configured based on the characteristics of Zookeeper components. For the solution to the time callback problem, set a threshold of 5 milliseconds, wait for the callback to regenerate a new UID after the callback time is less than 5 milliseconds, and throw an exception if it exceeds the threshold.

picturepicture

Most of the UID generation schemes of other large Internet companies are improved based on Twitter’s snowflake algorithm. Among them, Didi uses "time stamp + starting number + license plate number" to generate the corresponding UID, and Taobao orders use "time stamp + user ID "Generation, Didi loads the number segment into the memory based on Meituan's plan, and supports multiple master node modes at the same time. The UID generation of WeChat is mainly bound to the user serial number, and adopts the implementation method of step-by-step persistence and shared storage of semicolon segments.

Part 06

Application Practice of Provincial Broadband TV Membership Management Platform 

The provincial broadband TV member management platform mainly manages the member users of broadband TV in various provinces. It is a large-scale distributed platform, and the servers are distributed across provinces. There are a large number of servers, and there is a problem of clock rollback. Therefore, this platform improves on the basis of the snowflake algorithm, adds the concept of timeline, can support multiple timelines in parallel at the same time, and solves the clock rollback problem well. The specific plan is as follows:

picture

Adjust 2 bits to represent the timeline, and support up to 4 timelines. Set a unified initial time for all timelines, designate one timeline as the unified timeline, and generate a corresponding ID based on the timeline to advance the time progress. When the clock callback occurs on the machine, when the callback time interval is greater than the set threshold, switch to another timeline to continue generating IDs.