Mirrormaker2 getting disconnect errors from source kafka

Hi
I am trying to sync kafka data from source to a target kafka cluster using the kafka-connect hosted on target kafka cluster as bootstrap server and then using mirrormaker connectors on it, with below configuration. The source kafka is accessed via external ip. while the target is in the same k8s namespace. I am getting these errors in my mirrormaker log. Kafka version 3.9.1

2026-06-11 09:57:41 INFO NetworkClient:871 - [Consumer clientId=iad-5->phx-2|mirror-source-connector-0|replication-consumer, groupId=null] Disconnecting from node 2 due to request timeout.
2026-06-11 09:57:41 INFO NetworkClient:364 - [Consumer clientId=iad-5->phx-2|mirror-source-connector-0|replication-consumer, groupId=null] Cancelled in-flight FETCH request with correlation id 38 due to node 2 being disconnected (elapsed time since creation: 30115ms, elapsed time since send: 30005ms, throttle time: 0ms, request timeout: 30000ms)
2026-06-11 09:57:41 INFO FetchSessionHandler:618 - [Consumer clientId=iad-5->phx-2|mirror-source-connector-0|replication-consumer, groupId=null] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 2:
org.apache.kafka.common.errors.DisconnectException
2026-06-11 09:57:41 INFO NetworkClient:871 - [Consumer clientId=iad-5->phx-2|mirror-source-connector-0|replication-consumer, groupId=null] Disconnecting from node 0 due to request timeout.
2026-06-11 09:57:41 INFO NetworkClient:364 - [Consumer clientId=iad-5->phx-2|mirror-source-connector-0|replication-consumer, groupId=null] Cancelled in-flight FETCH request with correlation id 37 due to node 0 being disconnected (elapsed time since creation: 30124ms, elapsed time since send: 30006ms, throttle time: 0ms, request timeout: 30000ms)
2026-06-11 09:57:41 INFO FetchSessionHandler:618 - [Consumer clientId=iad-5->phx-2|mirror-source-connector-0|replication-consumer, groupId=null] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 0:
org.apache.kafka.common.errors.DisconnectException

and these in other pod for same mirrormaker

2026-06-11 14:32:36 INFO NetworkClient:1021 - [AdminClient clientId=iad-5->phx-2|mirror-checkpoint-connector|checkpoint-source-admin] Node 1 disconnected.
2026-06-11 14:32:36 INFO NetworkClient:364 - [AdminClient clientId=iad-5->phx-2|mirror-checkpoint-connector|checkpoint-source-admin] Cancelled in-flight METADATA request with correlation id 4341 due to node 1 being disconnected (elapsed time since creation: 52ms, elapsed time since send: 52ms, throttle time: 0ms, request timeout: 30000ms)
2026-06-11 14:33:02 INFO Scheduler:99 - refreshing idle consumers group offsets at target cluster took 142 ms
2026-06-11 14:33:02 INFO Scheduler:99 - sync idle consumer group offset from source to target took 1 ms

This is my mirrormaker configguration:
apiVersion: v1
kind: ConfigMap
metadata:
name: mirrormaker2-config
data:
connect-distributed.properties: |
bootstrap.servers={{TARGET_KAFKA_HOST}}
group.id=connect-cluster
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.topic=connect-offsets
offset.storage.replication.factor=3
config.storage.topic=connect-configs
config.storage.replication.factor=3
status.storage.topic=connect-status
status.storage.replication.factor=3
offset.flush.interval.ms=60000

mirror-checkpoint-connector.json: |
{
“name”: “mirror-checkpoint-connector”,
“config”: {
“connector.class”: “org.apache.kafka.connect.mirror.MirrorCheckpointConnector”,
“source.cluster.alias”: “{{SOURCE_CLUSTER_ALIAS}}”,
“target.cluster.alias”: “{{TARGET_CLUSTER_ALIAS}}”,
“source.cluster.bootstrap.servers”: “{{SOURCE_KAFKA_HOST}}”,
“target.cluster.bootstrap.servers”: “{{TARGET_KAFKA_HOST}}”,
“tasks.max”: “1”,
“emit.checkpoints.enabled”: “true”,
“sync.group.offsets.enabled”: “true”,
“emit.checkpoints.interval.seconds”: “60”,
“sync.group.offsets.interval.seconds”: “60”,
“refresh.groups.enabled”: “true”,
“refresh.groups.interval.seconds”: “600”,
“replication.policy.class”: “org.apache.kafka.connect.mirror.IdentityReplicationPolicy”,
“key.converter”: “org.apache.kafka.connect.converters.ByteArrayConverter”,
“value.converter”: “org.apache.kafka.connect.converters.ByteArrayConverter”
}
}
mirror-heartbeat-connector.json: |
{
“name”: “mirror-heartbeat-connector”,
“config”: {
“connector.class”: “org.apache.kafka.connect.mirror.MirrorHeartbeatConnector”,
“source.cluster.alias”: “{{SOURCE_CLUSTER_ALIAS}}”,
“target.cluster.alias”: “{{TARGET_CLUSTER_ALIAS}}”,
“source.cluster.bootstrap.servers”: “{{SOURCE_KAFKA_HOST}}”,
“target.cluster.bootstrap.servers”: “{{TARGET_KAFKA_HOST}}”,
“tasks.max”: “1”,
“replication.policy.class”: “org.apache.kafka.connect.mirror.IdentityReplicationPolicy”,
“key.converter”: “org.apache.kafka.connect.converters.ByteArrayConverter”,
“value.converter”: “org.apache.kafka.connect.converters.ByteArrayConverter”,
“emit.heartbeats.enabled”: “true”,
“emit.heartbeats.interval.seconds”: “1”
}
}
mirror-source-connector.json: |
{
“name”: “mirror-source-connector”,
“config”: {
“connector.class”: “org.apache.kafka.connect.mirror.MirrorSourceConnector”,
“source.cluster.alias”: “{{SOURCE_CLUSTER_ALIAS}}”,
“target.cluster.alias”: “{{TARGET_CLUSTER_ALIAS}}”,
“source.cluster.bootstrap.servers”: “{{SOURCE_KAFKA_HOST}}”,
“target.cluster.bootstrap.servers”: “{{TARGET_KAFKA_HOST}}”,
“topics”: “.",
“topics.exclude”: "heartbeats|connect-.
|.[-.]internal|.\.replica|__.*”,
“tasks.max”: “1”,
“auto.offset.reset”: “earliest”,
“replication.policy.class”: “org.apache.kafka.connect.mirror.IdentityReplicationPolicy”,
“key.converter”: “org.apache.kafka.connect.converters.ByteArrayConverter”,
“value.converter”: “org.apache.kafka.connect.converters.ByteArrayConverter”,
“sync.topic.configs.enabled”: “true”,
“sync.topic.acls.enabled”: “true”,
“refresh.topics.enabled”: “true”,
“refresh.topics.interval.seconds”: “600”,
“sync.topic.configs.interval.seconds”: “600”,
“sync.topic.acls.interval.seconds”: “600”,
“replication.factor”: “3”
}
}
log4j.properties: |

Root logger configuration: log to stdout

log4j.rootLogger=INFO, stdout

# Standard Output (stdout) appender: to capture logs in Kubernetes pod logs
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

What are the reasons, how to fix. This system was working earlier, recently i am observing this. Even restarting mirrormaker pod didn’t help.

Are you sure that the source Kafka cluster is up and healthy? You’d see these kinds of Node -1 disconnected errors for a bunch of reasons but that’s the first one I’d recommend checking given that your setup had been working before.

If the source cluster appears fine, can you run kafka-console-consumer against the source from the target cluster (i.e., mimic what MM2 is doing)? For a connectivity issue like this it’ll likely be less noisy to use a simpler client like kafka-console-consumer or a generic network communication command line tool like nc or telnet.

Yeah the source Kafka is running. I also verified connectivity from mirrormaker running at Target cluster the source is reachable, by running kafka-topic.sh --bootstrap-server command pointing to primary kafka server exposed on LB. And it listed me the list of topics.

I also verified the status of mirrormaker connectors, using connector status apis, it says status as running.

I’d recommend a consumer client just to get closer to what MM2 is doing, or nc / telnet from the target cluster direct to all brokers (not the LB) since that’s what MM2 will need post bootstrapping. The kafka-topic command runs the Kafka admin client so it doesn’t really test Kafka protocol.

Another question: does MM2 run successfully at all, or do you see connectivity errors immediately on startup?

I executed kafka-console-consumer from target kafka pod to the source kafka pod lbip:port and i was able to read the messages for a topic.
The mirrormaker started with logs about connectors started successfully and then i can also see it created tasks for different topic partitions to replicate. Post that it hit with the connection timeout errors.

Here are a few things I’d check / test to debug this further:

  • Any errors on the Kafka side when this happens?
  • Do other clients also hit errors or just MM2?
  • Does a slimmed down MM2 deployment also run into this? Try to see if scale is the cause by only replicating a trickle of data on a test topic
  • Are you running out of connections? Check the number of allowed open file handles on the broker side
  • No errors in kafka logs about this

  • No other client service is experiencing this issue.

  • I am not able to redeploy mirrormaker with updated configs to slim it down due to some deployment process limitations i have. So, I can’t validate this currently.

  • I think we are not hitting any file limitations, nothing such shown in logs also.

    One thing to bring to your notice is that only mirrormaker client is the one for which i am using kafka exposed via LB while other client services are local to the k8s cluster where kafka is deployed.

    Any more suggestions?

I’m pretty suspicious of this. Remember that the LB might work for initial bootstrapping to get broker metadata, but after that Kafka clients talk directly to the partition leader brokers. IOW, the target cluster needs to be able to route to the source cluster brokers’ advertised listeners. Given the suspicion, the next test I’d run is to not use the source Kafka LB in the MM2 config; use that direct set of advertised listener addresses. I see that you mentioned that MM2 replication had been working earlier, so I had been leaning toward source cluster errors or a scale-induced issue as opposed to connectivity, but this is worth a check.

Backing up on the LB topic, words of caution… Kafka and load balancers can be a really thorny combo. Many typical infrastructure software benefits you get from a load balancer don’t apply to Kafka given how the protocol works (and Kafka is natively providing those benefits). It’s not like a “don’t ever do this” combo, but more of a “be really sure about why you’re doing it and know enough of the Kafka internals and how they relate to LBs, as well as any potential gotchas like stickiness features in your LB, if you’re going to do it.” E.g., some people will put a LB in front of Kafka strictly for “failover without having to reconfigure clients” but even just that seemingly small thing is hard.

IMO you’re going to want to be do this in the problematic environment or a similar mirror environment setup in order to get to the bottom of this. It’s really hard to debug this kind of thing without being able to run tests like slimming down scale or cutting the LB out of the equation.