5 Minute Tech Talk | A brief introduction to the WebSocket protocol - RFC 6455

01 Introduction

WebSocket is a network communication protocol that performs full-duplex communication over a TCP connection. It was born in 2009, and was designated as a standard by the Internet Engineering Task Force (IETF) in 2011 and published RFC 6455 Internet Standard Tracking Document, and released RFC7936 document in 2016 to supplement it. The WebSocket API is also standardised by the W3C.

Image

The WebSocket protocol was originally designed to replace HTTP communication, as the RFC6202 mentions that the HTTP protocol was not originally intended for bidirectional data communication. The WebSocket protocol does not completely abandon HTTP, it achieves the goal of bidirectional communication in the existing environment based on the HTTP basic service. As stated in RFC 6455, WebSocke's design philosophy is a framework with minimal constraints, the only constraints being that the protocol is frame-based rather than stream-based, and supports both Unicode text and binary frames.

02 Handshake

The WebSocket protocol is divided into three parts: handshake, message transmission, and handshake, as shown in the following figure.

Image

2.1 Jianlian Handshake - Client

In order to be compatible with applications and proxies on the HTTP server side, the client connection handshake (including connections via proxy or TLS encrypted tunnel) is a valid HTTP upgrade request that follows the definition in the RFC2616, and the client connection handshake request header field is shown in the following figure. In addition, once the client has sent a connection handshake, it must wait for a response from the server.

Image

- Request URI

format, ws-URI = "ws:" "//" host [ ":" port ] path [ "?" query ] or wss-URI = "wss:" "//" host [ ":" port ] path [ "?" query ], any invalid value will cause the connection to fail

- Request line

Must be a GET method with at least 1.1 HTTP version

- Upgrade

The value must be "websocket", ASCII value, and case-insensitive

- Connection

The value must contain "Upgrade", an ASCII value, and is not case-sensitive

- Sec-WebSocket-Key

The client is a randomly generated 16-byte base64-encoded string

- Origin

Source address, required for browser client, optional for non-browser client

- Sec-WebSocket-Protocol

One or more comma-separated subprotocols supported by the client, in order of priority

- Sec-WebSocket-Version

The client intends to use the protocol version number, which must be 13. Historical versions 9, 10, 11, and 12 are no longer valid values

- Sec-WebSocket-Extensions

The client intends to use the protocol extension. At present, HyBi Working Group has carried out multiplexing and compression expansion, and the multiplexing extension realizes the sharing of underlying TCP connections. The compression extension adds compression capabilities to the WebSocket protocol, such as x-webkit-deflate-frame

2.2 Jianlian handshake - server

When a client establishes a WebSocket connection with the server, the server must reply to the client's request to establish a handshake, as shown in the following figure.

Image

- Status line

HTTP/1.1 101 Switching Protocols, which allows clients to establish connections. If the server wants to stop processing the client handshake, it can return an HTTP response with an error code such as 401

- Upgrade

The value must be "websocket"

- Connection

The value must contain "Upgrade"

- Sec-WebSocket-Accept

This value is generated if the server accepts the client connection. First, concatenate the Sec-WebSocket-Key value of the client request header with the globally unique identifier "258EAFA5-E914-47DA-95CA-C5AB0DC85B11" defined in the RFC4122 document, then perform SHA-1 hashing and base64-encoded to obtain the value

- Sec-WebSocket-Protocol

The protocol to be used by the server is selected from the Sec-WebSocket-Protocol sent by the client, if the server does not support it, the value is empty

- Sec-WebSocket-Extensions

The server is intended to use protocol extensions

2.3 Disconnect the handshake

Both the client and the server can send a control frame containing the specified control sequence (Close control frame) to start closing the handshake. When a party receives an off control frame, it simply sends a close frame in response and then closes the connection. The TCP close handshake is not always a reliable end-to-end handshake in the presence of intercept proxies, etc., and the above close handshake process is intended to complement the TCP close handshake (FIN/ACK).

03 Data Transfer

Once the client and the server connection handshake are successful, the two parties can start the data transfer. This is a two-way communication channel, and both parties can send data independently and at will based on the concept of "message" in the RFC 6455 specification. A message contains one or more data frames (not necessarily corresponding to messages in the network layer), and the Websocket frame format is shown in the following figure.

Image

3.1 Frame Structure

- FIN

1 bit, indicates whether it is the last shard of a message.

- RSV1, RSV2, RSV3

1 bit, the default value is 0 when the extended function is not in use.

- Opcode

4 bits, which defines the "Playload data" data type.

  • 0 (decimal): Consecutive frames
  • 1: Text frame
  • 2: Binary frame
  • 3-7: Reserve non-control frames
  • 8: Connection closes frame
  • 9: heartbeat ping frame
  • 10: Heartbeat pong frame
  • 11-15: reserved control frames

- MASK

1 bit, whether to block "Playload data", 1 yes, 0 no.

- Payload length

7 bits, 7 + 16 bits, or 7 + 64 bits, indicates the length of the payload data. Specifically, the payload length is less than 125, and the data length is represented by the payload length. The payload length is equal to 126, and the data length is represented by 16 bits after the payload length. The payload length is equal to 127, and the data length is represented by 64 bits after the payload length.

- Masking-key

32-bit, which holds the mask sent by the client. In order to prevent proxy cache contamination attacks, the RFC6455 requires that the mask must come from a strong source of entropy and be unpredictable. For the ith byte of the payload data, i is modulo 4 to obtain j, and the value of the ith byte of the payload data after the mask cover is the original ith byte and the jth byte of the masking-key to do bitwise XOR operation.

- Payload data

Payload data is divided into two types: extended data is negotiated in the handshake stage and application data is used after extended data.

3.2 Control Frames

Control frames are determined by Opcode values, and the opcodes for control frames currently defined by the protocol include 0x8 (Close), 0x9 (Ping), and 0xA (Pong). The control frame must have a payload length of less than or equal to 125 bytes, for the Close control frame, the first 2 bytes of the payload represent the status code, and the remaining bytes represent the reason for closing, as shown in the following figure.

Image

3.3 Message sharding

Message sharding refers to the concept of sending a "message" through multiple data frames. Message sharding allows you to send messages of unknown size without having to buffer the entire message. At the same time, message sharding, combined with the extension of the multiplexing protocol, can split messages into smaller segments to share the output channel.

In the protocol, the FIN bit of the start frame of the fragment message is 0, the opcode bit is non-0, the frame is a message fragment, the FIN bit of the middle frame is 0, the opcode bit is 0, and finally the end of the fragment is marked by the FIN bit 1 and the opcode bit is 0. The protocol requires that the sharded data frames be sent sequentially to the other end.

04 Summary

WebSocke is designed on top of the TCP layer, and does not need to consider the length of the data. It can also be combined with HTTP/2 multiplexing through the extension function to make full use of bandwidth. Developers only need to process the message sharding logic sequentially in the server-side and client-side code.