捂脸斋

江山风月 🌘 本无常主 🌈 闲者便是主人 🍃🍃🍃

Dubbo 若干连接错误分析

Dubbo 若干连接错误分析

这里以 defray 和 promotion 两个服务进行分析,defray 作为消费者,promotion 作为提供者。

报错场景

操作步骤

  1. 两个服务正常启动并运行一段时间。
  2. 向 promotion 发送普通终止信号,使其进入应用关闭流程。
  3. promotion 从 Nacos 中消失,defray 收到 promotion 的下线通知,但随后仍出现一次连接失败日志。

这里的“终止”不是 kill -9SIGKILL 不会执行 JVM shutdown hook,服务端也没有机会主动注销或发送 GOAWAY,不能用于验证本文讨论的优雅下线流程。

日志

收到 promotion 下线通知。

1
2
3
4
5
6
7
8
9
10
11
2025-12-03 09:41:08.394 [vent.InstancesChangeEvent] .r.c.e.l.ServiceInstancesChangedListener INFO [Log4j2Logger.info.71] [] [DUBBO] Received instance notification, serviceName: aladdin-promotion-center, instances: 1, dubbo version: 3.3.5, current host: 192.169.247.70
2025-12-03 09:41:08.395 [vent.InstancesChangeEvent] .r.c.e.l.ServiceInstancesChangedListener INFO [Log4j2Logger.info.71] [] [DUBBO] 0 unique working revisions: , dubbo version: 3.3.5, current host: 192.169.247.70
2025-12-03 09:41:08.396 [vent.InstancesChangeEvent] .r.c.e.l.ServiceInstancesChangedListener INFO [Log4j2Logger.info.71] [] [DUBBO] Notify service DEFAULT_GROUP/com.aladdin.aistudio.promotion.coupon.CouponTemplateRpc:v1.0.0:null with urls1, dubbo version: 3.3.5, current host: 192.169.247.70
2025-12-03 09:41:08.397 [vent.InstancesChangeEvent] .d.r.c.ServiceDiscoveryRegistryDirectory WARN [Log4j2Logger.warn.91] [] [DUBBO] Received url with EMPTY protocol from registry ServiceDiscoveryRegistryDirectory(registry: mse-9adaaae2-p.nacos-ans.mse.aliyuncs.com:8848, subscribed key: [aladdin-promotion-center])-Directory(invokers: 1[192.169.120.255:20886], validInvokers: 1[192.169.120.255:20886], invokersToReconnect: 0[]), will clear all available addresses., dubbo version: 3.3.5, current host: 192.169.247.70, error code: 4-1. This may be caused by ,go to https://dubbo.apache.org/faq/4/1 to find instructions.
2025-12-03 09:41:08.398 [vent.InstancesChangeEvent] .r.c.m.DefaultMigrationAddressComparator INFO [Log4j2Logger.info.71] [] [DUBBO] No instance address available, stop compare., dubbo version: 3.3.5, current host: 192.169.247.70
2025-12-03 09:41:08.398 [vent.InstancesChangeEvent] .d.r.c.ServiceDiscoveryRegistryDirectory INFO [Log4j2Logger.info.71] [] [DUBBO] Received invokers changed event from registry. Registry type: instance. Service Key: DEFAULT_GROUP/com.aladdin.aistudio.promotion.coupon.CouponTemplateRpc:v1.0.0. Urls Size : 1. Invokers Size : 0. Available Size: 0. Available Invokers : empty, dubbo version: 3.3.5, current host: 192.169.247.70
2025-12-03 09:41:08.398 [vent.InstancesChangeEvent] .r.c.e.l.ServiceInstancesChangedListener INFO [Log4j2Logger.info.71] [] [DUBBO] Notify service DEFAULT_GROUP/com.aladdin.aistudio.promotion.coupon.UserCouponRpc:v1.0.0:null with urls 1,dubbo version: 3.3.5, current host: 192.169.247.70
2025-12-03 09:41:08.399 [vent.InstancesChangeEvent] .d.r.c.ServiceDiscoveryRegistryDirectory WARN [Log4j2Logger.warn.91] [] [DUBBO] Received url with EMPTY protocol from registry ServiceDiscoveryRegistryDirectory(registry: mse-9adaaae2-p.nacos-ans.mse.aliyuncs.com:8848, subscribed key: [aladdin-promotion-center])-Directory(invokers: 1[192.169.120.255:20886], validInvokers: 1[192.169.120.255:20886], invokersToReconnect: 0[]), will clear all available addresses., dubbo version: 3.3.5, current host: 192.169.247.70, error code: 4-1. This may be caused by , go to https://dubbo.apache.org/faq/4/1 to find instructions.
2025-12-03 09:41:08.400 [vent.InstancesChangeEvent] o.a.d.r.t.netty4.NettyConnectionClient INFO [Log4j2Logger.info.76] [] [DUBBO] Destroying connection to /192.169.120.255:20886, because the reference count reaches 0, dubbo version: 3.3.5, current host: 192.169.247.70
2025-12-03 09:41:08.401 [vent.InstancesChangeEvent] .r.c.m.DefaultMigrationAddressComparator INFO [Log4j2Logger.info.71] [] [DUBBO] No instance address available, stop compare., dubbo version: 3.3.5, current host: 192.169.247.70
2025-12-03 09:41:08.402 [vent.InstancesChangeEvent] .d.r.c.ServiceDiscoveryRegistryDirectory INFO [Log4j2Logger.info.71] [] [DUBBO] Received invokers changed event from registry. Registry type: instance. Service Key: DEFAULT_GROUP/com.aladdin.aistudio.promotion.coupon.UserCouponRpc:v1.0.0. Urls Size : 1. Invokers Size: 0. Available Size: 0. Available Invokers : empty, dubbo version: 3.3.5, current host: 192.169.247.70

2s 后,另一个线程连接报错。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
2025-12-03 09:41:10.893 [tivity-scheduler-thread-1] o.a.d.r.t.netty4.NettyConnectionClient ERROR [Log4j2Logger.error.126] [] [DUBBO] Failed to connect to provider server by other reason, dubbo version: 3.3.5, current host: 192.169.247.70, error code: 6-1. This may be caused by network disconnected, go to https://dubbo.apache.org/faq/6/1 to find instructions.
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.169.120.255:20886
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:336)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:784)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:732)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:658)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:833)
2025-12-03 09:41:10.894 [tivity-scheduler-thread-1] o.a.d.r.t.netty4.NettyConnectionClient ERROR [Log4j2Logger.error.111] [] [DUBBO] Failed to reconnect to server: /192.169.120.255:20886, dubbo version: 3.3.5, current host: 192.169.247.70, error code: 6-16. This may be caused by , go to https://dubbo.apache.org/faq/6/16 to find instructions.

原因分析

1
为什么已经收到了下线通知,还报错了?并且整个过程没有对 promotion 服务进行 RPC 调用。

收到下线通知的线程是 nacos.publisher-com.alibaba.nacos.client.naming.event.InstancesChangeEvent,报错线程是 Dubbo-framework-connectivity-scheduler-thread-1。日志中的线程名因为日志格式长度限制而被截断。

后一个线程执行的是 Dubbo 提交的异步重连任务,不是业务 RPC 调用,也不能仅凭线程名将它定义为周期性网络探测。Nacos 通知和连接重连由不同线程处理,因此“已经收到下线通知”和“随后仍发生一次重连”在并发时序上并不矛盾。

三个线程池

A Nacos 实例变更线程池

nacos.publisher-com.alibaba.nacos.client.naming.event.InstancesChangeEvent 这个线程池,负责处理实例变更事件。当有服务下线时,其调用栈如下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
"nacos.publisher-com.alibaba.nacos.client.naming.event.InstancesChangeEvent@17340" daemon prio=5 tid=0xc2 nid=NA runnable java.lang.Thread.State: RUNNABLE
at org.apache.dubbo.remoting.transport.AbstractClient.disconnect(AbstractClient.java:307)
at org.apache.dubbo.remoting.transport.AbstractClient.close(AbstractClient.java:386)
at org.apache.dubbo.remoting.transport.netty4.AbstractNettyConnectionClient.destroy(AbstractNettyConnectionClient.java:340)
at org.apache.dubbo.remoting.api.connection.AbstractConnectionClient.release(AbstractConnectionClient.java:80)
at org.apache.dubbo.rpc.protocol.tri.TripleInvoker.destroy(TripleInvoker.java:367)
at org.apache.dubbo.rpc.listener.ListenerInvokerWrapper.destroy(ListenerInvokerWrapper.java:82)
at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CopyOfFilterChainNode.destroy(FilterChainBuilder.java:383)
at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CallbackRegistrationInvoker.destroy(FilterChainBuilder.java:275)
at org.apache.dubbo.rpc.protocol.ReferenceCountInvokerWrapper.destroy(ReferenceCountInvokerWrapper.java:86)
at org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory.destroyAllInvokers(ServiceDiscoveryRegistryDirectory.java:614)
at org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory.refreshInvoker(ServiceDiscoveryRegistryDirectory.java:333)
at org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory.refreshOverrideAndInvoker(ServiceDiscoveryRegistryDirectory.java:222)
- locked <0x5065> (a org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory)
at org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory.notify(ServiceDiscoveryRegistryDirectory.java:214)
at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.lambda$notifyAddressChanged$10(ServiceInstancesChangedListener.java:473)
at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener$$Lambda$2665/0x0000007001fe7c10.accept(Unknown Source:-1)
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1603)
at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.lambda$notifyAddressChanged$11(ServiceInstancesChangedListener.java:464)
at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener$$Lambda$2664/0x0000007001fe79e0.get(Unknown Source:-1)
at org.apache.dubbo.metrics.event.MetricsEventBus.post(MetricsEventBus.java:76)
at org.apache.dubbo.metrics.event.MetricsEventBus.post(MetricsEventBus.java:59)
at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.notifyAddressChanged(ServiceInstancesChangedListener.java:461)
at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.doOnEvent(ServiceInstancesChangedListener.java:225)
- locked <0x5066> (a org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener)
at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.onEvent(ServiceInstancesChangedListener.java:125)
at org.apache.dubbo.registry.nacos.NacosServiceDiscovery.handleEvent(NacosServiceDiscovery.java:245)
at org.apache.dubbo.registry.nacos.NacosServiceDiscovery.access$000(NacosServiceDiscovery.java:65)
at org.apache.dubbo.registry.nacos.NacosServiceDiscovery$NacosEventListener.onEvent(NacosServiceDiscovery.java:217)
at com.alibaba.nacos.client.naming.event.InstancesChangeNotifier.onEvent(InstancesChangeNotifier.java:126)
at com.alibaba.nacos.client.naming.event.InstancesChangeNotifier.onEvent(InstancesChangeNotifier.java:42)
at com.alibaba.nacos.common.notify.DefaultPublisher.lambda$notifySubscriber$0(DefaultPublisher.java:199)
at com.alibaba.nacos.common.notify.DefaultPublisher$$Lambda$2524/0x0000007001f83d68.run(Unknown Source:-1)
at com.alibaba.nacos.common.notify.DefaultPublisher.notifySubscriber(DefaultPublisher.java:206)
at com.alibaba.nacos.common.notify.DefaultPublisher.receiveEvent(DefaultPublisher.java:190)
at com.alibaba.nacos.common.notify.DefaultPublisher.openEventHandler(DefaultPublisher.java:112)
at com.alibaba.nacos.common.notify.DefaultPublisher.run(DefaultPublisher.java:95)

B Netty worker 线程池

Triple 使用 HTTP/2。客户端收到 GOAWAY 后,TripleGoAwayHandler 会把事件交给 NettyConnectionHandler

GOAWAY 表示当前 HTTP/2 连接将不再接受新的流,并不禁止客户端建立新连接。Dubbo 3.3.5 的实现会在收到 GOAWAY 后清理当前连接,并在连接对象尚未关闭时调度一次重连。

C Dubbo 异步重连线程池

Dubbo-framework-connectivity-scheduler-thread 执行 AbstractNettyConnectionClient#doReconnect 提交的异步连接任务。本次错误日志来自该线程池中的 doConnect(),不能据此认定发生了业务调用。

调用链路

先看第一个线程池,也就是 Nacos 实例变更线程池。地址列表清空后,旧 Invoker 被销毁,连接引用计数降为 0,最终执行:

1
2
3
4
5
6
ServiceDiscoveryRegistryDirectory.destroyAllInvokers()
-> TripleInvoker.destroy()
-> AbstractConnectionClient.release()
-> AbstractNettyConnectionClient.destroy()
-> AbstractClient.close()
-> AbstractClient.disconnect()

调试时发现 AbstractClient.disconnect()getChannel() 返回 null,因此这里没有再次关闭 Channel。

image-20260612181244328

再看 Netty worker 线程中的 GOAWAY 链路。

image-20260612181309982

image-20260612181330076

  1. TripleGoAwayHandler 收到 Http2GoAwayFrame
  2. NettyConnectionHandler.onGoAway() 调用 connectionClient.onGoaway(nettyChannel)
  3. onGoaway()channelRef 从当前 Channel 置为 null,并关闭当前 Channel。
  4. NettyConnectionHandler.reconnect() 检查连接对象是否已关闭:
    • 如果已经关闭,记录“will not reconnect”并结束。
    • 如果尚未关闭,在 Netty EventLoop 上延迟 1 秒调用 doReconnect()
  5. doReconnect() 再把 doConnect() 提交到 Dubbo connectivity executor。本次连接时服务端端口已经关闭,因此得到 Connection refused

对应源码为 org.apache.dubbo.remoting.transport.netty4.NettyConnectionHandler

完整调用栈如下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
"NettyClientWorker-11-4@27041" daemon prio=5 tid=0x5fc nid=NA runnable java.lang.Thread.State: RUNNABLE
at org.apache.dubbo.remoting.transport.netty4.NettyConnectionHandler.reconnect(NettyConnectionHandler.java:71)
at org.apache.dubbo.remoting.transport.netty4.NettyConnectionHandler.onGoAway(NettyConnectionHandler.java:66)
at org.apache.dubbo.rpc.protocol.tri.transport.TripleGoAwayHandler.channelRead(TripleGoAwayHandler.java:44)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:93)
at org.apache.dubbo.rpc.protocol.tri.TriplePingPongHandler.channelRead(TriplePingPongHandler.java:41)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.handler.codec.http2.Http2MultiplexHandler.channelRead(Http2MultiplexHandler.java:205)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.handler.codec.http2.Http2FrameCodec.onHttp2Frame(Http2FrameCodec.java:732)
at io.netty.handler.codec.http2.Http2FrameCodec$FrameListener.onGoAwayRead(Http2FrameCodec.java:677)
at io.netty.handler.codec.http2.Http2FrameListenerDecorator.onGoAwayRead(Http2FrameListenerDecorator.java:89)
at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.onGoAwayRead0(DefaultHttp2ConnectionDecoder.java:229)
at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onGoAwayRead(DefaultHttp2ConnectionDecoder.java:609)
at io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onGoAwayRead(Http2InboundFrameLogger.java:117)
at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readGoAwayFrame(DefaultHttp2FrameReader.java:579)
at io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:265)
at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:164)
at io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:39)
at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:186)
at io.netty.handler.codec.http2.DecoratingHttp2ConnectionDecoder.decodeFrame(DecoratingHttp2ConnectionDecoder.java:61)
at io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:391)
at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:451)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:732)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:658)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:840)

getChannel() 为什么返回 null

Dubbo 3.3.5 中,AbstractNettyConnectionClient#getChannel() 的逻辑是:

1
2
3
4
5
6
7
protected Channel getChannel() {
io.netty.channel.Channel c = getNettyChannel();
if (c == null) {
return null;
}
return NettyChannel.getOrAddChannel(c, getUrl(), this);
}

其中 getNettyChannel() 直接读取 channelRef。收到 GOAWAY 时,onGoaway() 会执行:

1
2
3
4
5
6
if (channelRef.compareAndSet(nettyChannel, null)) {
if (nettyChannel.isOpen()) {
nettyChannel.close();
}
NettyChannel.removeChannelIfDisconnected(nettyChannel);
}

因此,这里的 getChannel() == null 首先表示 channelRef 已经被清空。代码还没有执行到 NettyChannel.getOrAddChannel(),不能据此归因于 NettyChannel.CHANNEL_MAP,也不能说明 Channel 没有被正确关闭。

当前能够确认的原因

当前日志和源码能够确认:

  1. 服务端下线过程中,客户端收到了 GOAWAY。
  2. Dubbo 3.3.5 收到 GOAWAY 后,在连接对象尚未关闭时会调度重连。
  3. Nacos 地址移除会并发触发 Invoker 销毁和连接对象关闭。
  4. 本次重连真正执行时,服务端端口已经关闭,所以出现一次 Connection refused

因此,更准确的描述是:

GOAWAY 触发的重连与 Nacos 地址移除、Invoker 销毁并发发生。在本次时序中,重连任务通过了连接关闭检查,随后在服务端端口已经关闭时尝试建立新连接,从而产生错误日志。

这里仍有一个竞态点需要进一步确认:reconnect()doConnect()close() 理论上都包含关闭状态检查。仅凭现有日志,尚不能确定重连是在哪个检查点之前进入,也不能确认处理 Nacos 通知和 GOAWAY 的是否始终是同一个 NettyConnectionClient 实例。

建议临时增加以下调试信息,以便得到最终根因:

  • System.identityHashCode(connectionClient)
  • connectionClient.isClosed()
  • connectionClient.getCounter()
  • channelRef 当前值
  • 进入 onGoAway()reconnect()doReconnect()doConnect()destroy() 的完整时间戳

规避方式

服务停止前先执行 Dubbo QoS offline,使注册中心先移除该实例,并为消费者更新地址列表预留时间,然后再停止进程。

该方式改变了“注册中心摘除、连接关闭、进程退出”的时序,能够在当前验证中避免错误日志。但它是推荐的下线流程和竞态规避方案,现有实验不能反向证明 CHANNEL_MAP 是问题根因。

新的调用链路

从 Nacos 实例变更线程开始,可以看到连接对象被销毁时 Channel 仍然可获取,因此执行了 channel.close()

image-20260612180526595

再看 netty 的 goAway 链路。

image-20260612180546749

GOAWAY 处理进入 reconnect() 时,连接对象已经处于关闭状态,因此命中 isClosed() 判断,不再提交重连任务。

验证

对照案例:直接停止进程

启动提供方 promotion 和消费方 defray。直接停止提供方进程,promotion 日志如下。

image-20260612180736869

1
2
1. 成功下线。
2. netty 连接断开。

defray 日志如下。

image-20260612180815317

1
2
3
1. 收到了服务下线通知。
2. 指向 promotion:20886 的 Netty 连接断开。
3. 发生 reconnect,报错。

实验案例:先执行 QoS offline

启动提供方 promotion 和消费方 defray。在提供方执行 QoS offline,等待消费者收到地址变更后再停止进程。promotion 日志如下。

image-20260612180844057

1
2
1. 成功下线。
2. netty 连接断开。

defray 日志如下。

image-20260612180905314

1
2
3
1. 收到了服务下线通知。
2. 指向 promotion:20886 的 Netty 连接断开。
3. 不重连。

提供方 promotion 下线后,消费方无错误日志。

结论

直接停止进程的案例里,没有出现下面两行日志:

1
2
2025-12-05 14:20:38.664 [vent.InstancesChangeEvent] o.a.d.r.transport.netty4.NettyChannel INFO [Log4j2Logger.info.71] [] [DUBBO] Close netty channel [id:0xfd9c29db, L:/10.230.206.2:63671 - R:/10.230.206.2:20886], dubbo version: 3.3.5,current host: 10.230.206.2
2025-12-05 14:20:38.664 [NettyClientWorker-11-3 ] o.a.d.r.t.netty4.NettyConnectionHandler INFO [Log4j2Logger.info.76] [] [DUBBO] The connection org.apache.dubbo.remoting.transport.netty4.NettyConnectionClient [10.230.206.2/<unresolved>:0 -> /10.230.206.2:20886] (Ref=0,local=null,remote=/10.230.206.2:20886 has been closed and will not reconnect, dubbo version: 3.3.5, current host: 10.230.206.2

这个现象与下面的解释一致:直接停止进程时,GOAWAY 处理没有观察到连接对象已经关闭,于是继续调度重连;先执行 QoS offline 时,连接对象先被关闭,GOAWAY 处理命中 isClosed() 后停止重连。要把该解释提升为确定结论,仍需通过连接对象标识和状态日志确认两个线程操作的是同一实例。

最终结论如下:

  1. 没有业务 RPC 调用时,Dubbo Triple 仍可能因 GOAWAY 触发连接重建。
  2. getChannel() == nullchannelRef 已被清空的结果,不是 CHANNEL_MAP 查找失败的证据。
  3. 这次错误是下线阶段的并发时序导致的一次失败重连,不代表消费者仍把已下线实例用于业务调用。
  4. QoS offline 加等待窗口可以规避当前现象,但若要定位到 Dubbo 内部的具体竞态点,还需要补充同一连接对象的状态日志。

一些说明

下面的配置不能单独实现“先从 Nacos 摘除 Dubbo 服务,固定等待 30 秒,再停止 Dubbo 服务”的目标:

1
2
3
4
5
6
7
server:
port: 8080
shutdown: graceful

spring:
lifecycle:
timeout-per-shutdown-phase: 30s

server.shutdown=graceful 控制 Spring Boot Web Server 的优雅停机。对于当前使用的 Tomcat,它会停止接收新请求,并等待进行中的 HTTP 请求完成;如果没有活跃请求,可能立即完成,不会固定等待 30 秒。

spring.lifecycle.timeout-per-shutdown-phase 是该生命周期阶段允许等待的最长时间,也不是固定延迟时间。

还需要区分两套注册行为:

  • Dubbo 自身通过 Dubbo registry/service discovery 注册的服务。
  • Spring Cloud Alibaba 通过 NacosServiceRegistry 注册的 Spring Cloud 应用实例。

下面日志中的 com.alibaba.cloud.nacos.registry.NacosServiceRegistry 属于 Spring Cloud Alibaba,不能直接称为“Dubbo 向 Nacos 反注册”。日志显示 Web Server 的 graceful shutdown 先完成,随后 Spring Cloud Alibaba 才从 Nacos 注销,因此该配置不满足“先摘流量,再等待连接排空”的目标。

1
2
3
4
5
6
7
8
2025-12-03 19:06:42.413 [ngApplicationShutdownHook] o.a.d.c.s.c.DubboSpringInitializer INFO [Log4j2Logger.info.71] [] [DUBBO] Unbind Dubbo Module[1.1.1] from spring container: org.springframework.beans.factory.support.DefaultListableBeanFactory@5ccb85d6, dubbo version: 3.3.0, current host: 10.230.206.2
2025-12-03 19:06:42.415 [ngApplicationShutdownHook] o.s.b.w.e.tomcat.GracefulShutdown INFO [GracefulShutdown.shutDownGracefully.54] [] Commencing graceful shutdown. Waiting for active requests to complete
2025-12-03 19:06:42.417 [tomcat-shutdown ] o.s.b.w.e.tomcat.GracefulShutdown INFO [GracefulShutdown.doShutdown.76] [] Graceful shutdown complete
2025-12-03 19:06:42.427 [ngApplicationShutdownHook] c.a.c.n.registry.NacosServiceRegistry INFO [NacosServiceRegistry.deregister.95] [] De-registering from Nacos Server now...
2025-12-03 19:06:42.473 [ngApplicationShutdownHook] c.a.c.n.registry.NacosServiceRegistry INFO [NacosServiceRegistry.deregister.115] [] De-registration finished.
2025-12-03 19:06:42.482 [ngApplicationShutdownHook] c.a.o.s.s.SchedulerElectionConvention INFO [SchedulerElectionConvention.destroy.134] [] [OGN-SchedulerEc] Released lock on destroy.
2025-12-03 19:06:42.484 [ngApplicationShutdownHook] com.zaxxer.hikari.HikariDataSource INFO [HikariDataSource.close.349] [] HikariPool-1 - Shutdown initiated...
2025-12-03 19:06:42.535 [ngApplicationShutdownHook] com.zaxxer.hikari.HikariDataSource INFO [HikariDataSource.close.351] [] HikariPool-1 - Shutdown completed.

另外,这组日志使用 Dubbo 3.3.0,而前文连接问题使用 Dubbo 3.3.5。它可以说明当前应用中的 Spring 生命周期顺序,但不能单独证明 Dubbo 3.3.5 的内部行为。

更稳妥的发布流程是:

  1. 调用 Dubbo QoS offline 或平台提供的摘流量能力。
  2. 等待消费者收到注册中心地址变更,并等待在途请求结束。
  3. 再向应用发送普通终止信号。
  4. 超过平台设定的最大等待时间后,再执行强制终止。

参考源码