Let's talk about TCP tracepoints

Let's talk about TCP tracepoints

I can imagine how useful these TCP tracepoints would be, since I designed and used a similar tracepoint many years ago: the DTrace TCP provider I demonstrated at CEC2006. I originally split TCP state changes into probes for each state, but when it was merged it became a single tcp:::state change probe like we do in Linux now via socks tracepoint .

TCP tracepoints have arrived for Linux! Linux 4.15 added five of them, and 4.16 (not yet fully released) added two more (tcp: tcp_probe and sock: inet_sock_set_state - a socket tracepoint that can be used for TCP analysis). We now have:

# perf list 'tcp:*' 'sock:inet*'

List of pre-defined events (to be used in -e):

  tcp:tcp_destroy_sock                               [Tracepoint event]
  tcp:tcp_probe                                      [Tracepoint event]
  tcp:tcp_receive_reset                              [Tracepoint event]
  tcp:tcp_retransmit_skb                             [Tracepoint event]
  tcp:tcp_retransmit_synack                          [Tracepoint event]
  tcp:tcp_send_reset                                 [Tracepoint event]

  sock:inet_sock_set_state                           [Tracepoint event]
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.

This includes a multipurpose :sock:inet_sock_set_state. It can be used to track when the kernel changes the state of a TCP session, for example from TCP_SYN_SENT to TCP_ESTABLISHED. An example is my tcplife tool in the open source bcc collection:

# tcplife
PID   COMM       LADDR           LPORT RADDR           RPORT TX_KB RX_KB MS
22597 recordProg 127.0.0.1       46644 127.0.0.1       28527     0     0 0.23
3277  redis-serv 127.0.0.1       28527 127.0.0.1       46644     0     0 0.28
22598 curl       100.66.3.172    61620 52.205.89.26    80        0     1 91.79
22604 curl       100.66.3.172    44400 52.204.43.121   80        0     1 121.38
22624 recordProg 127.0.0.1       46648 127.0.0.1       28527     0     0 0.22
3277  redis-serv 127.0.0.1       28527 127.0.0.1       46648     0     0 0.27
[...]
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.

I wrote tcplife before this tracepoint existed, so I had to use kprobes (kernel dynamic tracing) of the tcp_set_state() kernel function. This works, but it relies on various kernel implementation details which may change from one kernel version to the next. In order to keep tcplife working properly, different code needs to be included every time the kernel changes, which becomes difficult to maintain and enhance. Imagine having to test changes on several different kernel versions, since tcplife has special code for each!

Tracepoints are considered a "stable API", so their details should not change from one kernel version to the next, making programs that use them easier to maintain. I say "shouldn't" on purpose because I think these are "best effort" rather than "set in stone". Convincing kernel maintainers to accept new tracepoints will be much harder (for good reason) if they are considered immutable. For example: tcp:tcp_set_state was added in 4.15, then sock:inet_sock_set_state was added in 4.16. Since socks is a superset, tcp was disabled in 4.16 and will be removed. We try to avoid changing trackpoints like this, but in this case it's short-lived and removed before anyone can use it.

Anyway, tcplife isn't a good example of using tracepoints, since it's out of tracepoint API in three places (tx and rx bytes, and best-effort process context on tracepoints), so it might still Needs some maintenance. But this is a big improvement over the kprobes version, and other tools can only stick to the tracepoint API.

Another way to show sock:inet_sock_set_state is to use Sasha Goldshtein's bcc trace tool to compare it to kprobes on tcp_set_state(). The first command uses kprobes, the second uses tracepoints:

# trace -t -I net/sock.h 'p::tcp_set_state(struct sock *sk) "%llx: %d -> %d", sk, sk->sk_state, arg2'
TIME     PID     TID     COMM            FUNC             -
2.583525 17320   17320   curl            tcp_set_state    ffff9fd7db588000: 7 -> 2
2.584449 0       0       swapper/5       tcp_set_state    ffff9fd7db588000: 2 -> 1
2.623158 17320   17320   curl            tcp_set_state    ffff9fd7db588000: 1 -> 4
2.623540 0       0       swapper/5       tcp_set_state    ffff9fd7db588000: 4 -> 5
2.623552 0       0       swapper/5       tcp_set_state    ffff9fd7db588000: 5 -> 7
^C
# trace -t 't:sock:inet_sock_set_state "%llx: %d -> %d", args->skaddr, args->oldstate, args->newstate'
TIME     PID     TID     COMM            FUNC             -
1.690191 17308   17308   curl            inet_sock_set_state ffff9fd7db589800: 7 -> 2
1.690798 0       0       swapper/24      inet_sock_set_state ffff9fd7db589800: 2 -> 1
1.727750 17308   17308   curl            inet_sock_set_state ffff9fd7db589800: 1 -> 4
1.728041 0       0       swapper/24      inet_sock_set_state ffff9fd7db589800: 4 -> 5
1.728063 0       0       swapper/24      inet_sock_set_state ffff9fd7db589800: 5 -> 7
^C
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.

Both show the same output. for reference:

1: TCP_ESTABLISHED

2: TCP_SYN_SENT

3: TCP_SYN_RECV

4: TCP_FIN_WAIT1 5: TCP_FIN_WAIT2 6: TCP_TIME_WAIT

7: TCP_CLOSE

8: TCP_CLOSE_WAIT

I know, I know, I should have added it as a lookup hash, and then... After a while, here's a new tool I just contributed to bcc - tcpstate, which does the translation, and shows every Duration of states:

# tcpstates
SKADDR           C-PID C-COMM     LADDR           LPORT RADDR         RPORT OLDSTATE    -> NEWSTATE    MS
ffff9fd7e8192000 22384 curl       100.66.100.185  0     52.33.159.26  80    CLOSE       -> SYN_SENT    0.000
ffff9fd7e8192000 0     swapper/5  100.66.100.185  63446 52.33.159.26  80    SYN_SENT    -> ESTABLISHED 1.373
ffff9fd7e8192000 22384 curl       100.66.100.185  63446 52.33.159.26  80    ESTABLISHED -> FIN_WAIT1   176.042
ffff9fd7e8192000 0     swapper/5  100.66.100.185  63446 52.33.159.26  80    FIN_WAIT1   -> FIN_WAIT2   0.536
ffff9fd7e8192000 0     swapper/5  100.66.100.185  63446 52.33.159.26  80    FIN_WAIT2   -> CLOSE       0.006
^C
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.

I demonstrated this on Linux 4.16 after Yafang Shao wrote an enhancement to show all state transitions, not just those implemented by the kernel. Here's what it looks like on 4.15:

 trace -I net/sock.h -t 'p::tcp_set_state(struct sock *sk) "%llx: %d -> %d", sk, sk->sk_state, arg2'
TIME     PID    TID    COMM         FUNC             -
3.275865 29039  29039  curl         tcp_set_state    ffff8803a8213800: 7 -> 2
3.277447 0      0      swapper/1    tcp_set_state    ffff8803a8213800: 2 -> 1
3.786203 29039  29039  curl         tcp_set_state    ffff8803a8213800: 1 -> 8
3.794016 29039  29039  curl         tcp_set_state    ffff8803a8213800: 8 -> 7
^C
# trace -t 't:tcp:tcp_set_state "%llx: %d -> %d", args->skaddr, args->oldstate, args->newstate'
TIME     PID    TID    COMM         FUNC             -
2.031911 29042  29042  curl         tcp_set_state    ffff8803a8213000: 7 -> 2
2.035019 0      0      swapper/1    tcp_set_state    ffff8803a8213000: 2 -> 1
2.746864 29042  29042  curl         tcp_set_state    ffff8803a8213000: 1 -> 8
2.754193 29042  29042  curl         tcp_set_state    ffff8803a8213000: 8 -> 7
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.

Back in 4.16, this is the current list of tracepoints with parameters provided by bcc's tplist utility:

# tplist -v 'tcp:*'
tcp:tcp_retransmit_skb
    const void * skbaddr;
    const void * skaddr;
    __u16 sport;
    __u16 dport;
    __u8 saddr[4];
    __u8 daddr[4];
    __u8 saddr_v6[16];
    __u8 daddr_v6[16];
tcp:tcp_send_reset
    const void * skbaddr;
    const void * skaddr;
    __u16 sport;
    __u16 dport;
    __u8 saddr[4];
    __u8 daddr[4];
    __u8 saddr_v6[16];
    __u8 daddr_v6[16];
tcp:tcp_receive_reset
    const void * skaddr;
    __u16 sport;
    __u16 dport;
    __u8 saddr[4];
    __u8 daddr[4];
    __u8 saddr_v6[16];
    __u8 daddr_v6[16];
tcp:tcp_destroy_sock
    const void * skaddr;
    __u16 sport;
    __u16 dport;
    __u8 saddr[4];
    __u8 daddr[4];
    __u8 saddr_v6[16];
    __u8 daddr_v6[16];
tcp:tcp_retransmit_synack
    const void * skaddr;
    const void * req;
    __u16 sport;
    __u16 dport;
    __u8 saddr[4];
    __u8 daddr[4];
    __u8 saddr_v6[16];
    __u8 daddr_v6[16];
tcp:tcp_probe
    __u8 saddr[sizeof(struct sockaddr_in6)];
    __u8 daddr[sizeof(struct sockaddr_in6)];
    __u16 sport;
    __u16 dport;
    __u32 mark;
    __u16 length;
    __u32 snd_nxt;
    __u32 snd_una;
    __u32 snd_cwnd;
    __u32 ssthresh;
    __u32 snd_wnd;
    __u32 srtt;
    __u32 rcv_wnd;
# tplist -v sock:inet_sock_set_state
sock:inet_sock_set_state
    const void * skaddr;
    int oldstate;
    int newstate;
    __u16 sport;
    __u16 dport;
    __u16 family;
    __u8 protocol;
    __u8 saddr[4];
    __u8 daddr[4];
    __u8 saddr_v6[16];
    __u8 daddr_v6[16];
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • twenty one.
  • twenty two.
  • twenty three.
  • twenty four.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.

The first TCP tracepoint added is Cong Wang's (Twitter) tcp:tcp_retransmit_skb. He cites my kprobe-based tcpretrans tool from perf-tools as an example consumer. Song Liu (Facebook) added five more tracepoints, including tcp:tcp_set_state is now sock:inet_sock_set_state. Thanks to Cong and Song, and David S. Miller (network maintainer) for accepting these and providing feedback on the ongoing tcp tracepoint work.

During development, I discussed recent additions with Song (and Alexei Starovoitov), ​​so I already knew why they existed and what they were for. Some rough descriptions of the current TCP tracepoints:

  • tcp: tcp_retransmit_skb: Track retransmits. Useful for understanding network issues, including congestion. Will be used by my tcpretrans tool instead of kprobes.
  • tcp: tcp_retransmit_synack: Track SYN/ACK retransmissions. I imagine this could be used for a kind of DoS detection (SYN flood triggering SYN/ACK followed by retransmission). This is separate from tcp:tcp_retransmit_skb because this event has no skb.
  • tcp: tcp_destroy_sock: Required by any program that summarizes details in memory of a TCP session that will be keyed by a sock address. This probe can be used to know if the session has ended so that the sock address will be reused and any aggregated data so far should be used and then deleted.
  • tcp: tcp_send_reset: This tracks RST sends during valid sockets to diagnose these types of problems.
  • tcp: tcp_receive_reset: Track RST reception.
  • tcp: tcp_probe: for TCP congestion window tracking, which also allows older TCP probe modules to be deprecated and removed. This was added by Masami Hiramatsu and merged into Linux 4.16.
  • socks: inet_sock_set_state: can be used for many things. The tcplife tool is one of them, but my tcpconnect and tcpaccept bcc tools can also be converted to use this tracepoint. We could add separate tcp:tcp_connect and tcp:tcp_accept tracepoints (or tcp:tcp_active_open and tcp:tcp_passive_open), but sock:inet_sock_set_state can be used for this purpose.

I can imagine how useful these TCP tracepoints would be, since I designed and used a similar tracepoint many years ago: the DTrace TCP provider I demonstrated at CEC2006. I originally split TCP state changes into probes for each state, but when it was merged it became a single tcp:::state change probe like we do in Linux now via socks tracepoint .

What's next? The tracepoints for tcp:tcp_send and tcp:tcp_receive may be convenient, but special attention must be paid to the tiny overhead they may add (both enabled and specifically disabled), since sending/receiving can be a very frequent activity. It would also be useful to have more trace points for error conditions, such as connection refused paths, which might help analyze DoS attacks.

If you're interested in adding TCP tracepoints, I recommend writing a kprobe solution first as a proof of concept and to gain some production experience. This is the role played by my previous kprobe tool. The kprobe solution will show whether the tracepoint is useful, and if so, help to include it in the Linux kernel maintainers.