'gRPC server blocked on SendMsg

We're having an issue where our gRPC streaming server is blocked on SendMsg with the following stack trace:

google.golang.org/grpc/internal/transport.(*writeQuota).get(0xc000de4040, 0x32)
    /root/go/pkg/mod/google.golang.org/[email protected]/internal/transport/flowcontrol.go:59 +0x74
google.golang.org/grpc/internal/transport.(*http2Server).Write(0xc000bb4680, 0xc000aa6000, {0xc000f2be60, 0x5, 0x5}, {0xc000d6d590, 0x2d, 0x2d}, 0x0)
    /root/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:1090 +0x23b
google.golang.org/grpc.(*serverStream).SendMsg(0xc0002785b0, {0xb8f9e0, 0xc000b686c0})
    /root/go/pkg/mod/google.golang.org/[email protected]/stream.go:1530 +0x1cc

Our server streams unidirectionally to clients. We had this issue before every 4-6 hours on a node, but after about 15 minutes, the TCP connection would close, the client would reconnect, and streaming would continue as before. We fixed this issue by initializing the server with a keep alive every 10s:

server := grpc.NewServer(grpc.KeepaliveParams(keepalive.ServerParameters{Time: time.Duration(10) * time.Second, Timeout: 0}))

and this issue stopped happening for the past two days. Now this issue has been happening for a single node the past 5 hours, and it hasn't gone away.

Here's the output of ss:

$ ss -ntmp|grep -A 1 9222
ESTAB      0      0      10.192.254.1:9222               10.120.224.70:50380
     skmem:(r0,rb524288,t0,tb524288,f0,w0,o0,bl0,d0)

For a server functioning properly on a node, the t (wmem_alloc) values and w (wmem_queued) values are non-zero. According to this answer, this indicates that no packets are queued up for transmit.

I also see keep-alive ACKs sent from the server every 10s. The sequence is:

  • server sends PSH, ACK
  • client immediately responds with PSH, ACK
  • server sends ACK to above
  • server sends another PSH, ACK after 10s

So the server keep-alive mechanism thinks everything is OK. I don't see any keep-alives from the client. I'll try setting a keep-alive for the client, but why is this problem happening?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source