MM2 Stops Replicating on Corrupt Record – Need Guidance for Active/Active Setup

Ismail · 22 May 2025 06:41

Hi Team,

We are running Apache MirrorMaker 2 (MM2) in active/active mode using connect-mirrormaker.sh on a 6-node Apache Kafka cluster (v3.8.1). The source cluster is running Kafka 2.8.0. Below is a summary of our setup and the issue we’re facing:

Setup

MM2 is running on the destination Kafka cluster (3.8.1).
We have bidirectional replication between clusters.
We are using a standard MM2 config with the following error-handling setup:

### Configure MM2 to manage corrupt records - start

# retry for at most 10 minutes times waiting up to 30 seconds between consecutive failures

{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.retry.timeout = 600000

{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.retry.timeout = 600000

{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.retry.delay.max.ms = 30000

{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.retry.delay.max.ms = 30000


# log error context along with application logs, but do not include configs and messages

{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.log.enable = true

{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.log.enable = true

{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.log.include.messages = false

{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.log.include.messages = false


# produce error context into the Kafka topic

{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.deadletterqueue.topic.name = dbeng-mm2-forward-deadletterqueue

{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.deadletterqueue.topic.name = dbeng-mm2-backward-deadletterqueue


# Tolerate all errors.

{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.tolerance = all

{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.tolerance = all


### Configure MM2 to manage corrupt records - end

We also tried adding these:

errors.log.enable = true
errors.log.include.messages = true

Issue

Every 2 weeks or so, if MM2 encounters a corrupt record from the source, the affected node stops replicating data. This leads to increasing lag on certain topics and partitions. The destination cluster can no longer fetch data from that source node. Despite the error-handling configs above, replication remains stuck until we manually restart the MM2 process on all 6 nodes(restarting only affected node is not helping to restart replication).

FYI, here our full MM2 config

## Mirror Maker Configurations
# name of the connector, e.g. "us-west->us-east"
name = {{ source_cluster_name }}-{{ dest_cluster_name }}

# Maximum number of tasks to use for this connector
tasks.max = 12
num.stream.threads = 6 

# Setting replication factor of newly created remote topics
replication.factor = 3

errors.log.enable = true
errors.log.include.messages = true

# use ByteArrayConverter to ensure that records are not re-encoded and stay the same
key.converter = org.apache.kafka.connect.converters.ByteArrayConverter
value.converter = org.apache.kafka.connect.converters.ByteArrayConverter

# enable distributed mode in 3.8.1 kafka
dedicated.mode.enable.internal.rest = true

## Kafka clusters aliases
clusters = {{ source_cluster_name }}, {{ dest_cluster_name }}


# upstream cluster to replicate
{{ source_cluster_name }}.bootstrap.servers = {{ source_cluster_ips }}
 

# downstream cluster
{{ dest_cluster_name }}.bootstrap.servers = {{ dest_cluster_ips }}

# enable and configure individual replication flows
{{ source_cluster_name }}->{{ dest_cluster_name }}.enabled = true
{{ dest_cluster_name }}->{{ source_cluster_name }}.enabled = true    

# whether or not to monitor source cluster for configuration changes
{{ source_cluster_name }}->{{ dest_cluster_name }}.sync.topics.configs.enabled = true
{{ dest_cluster_name }}->{{ source_cluster_name }}.sync.topics.configs.enabled = true

# Do not to monitor source cluster for ACLs changes
{{ source_cluster_name }}->{{ dest_cluster_name }}.sync.topic.acls.enabled = false
{{ dest_cluster_name }}->{{ source_cluster_name }}.sync.topic.acls.enabled = false

# regex of topics to replicate, e.g. "topic1|topic2|topic3". Comma-separated lists are also supported.
{{ source_cluster_name }}->{{ dest_cluster_name }}.topics =  {{ src_to_gcp_topics_to_replicate }}
{{ dest_cluster_name }}->{{ source_cluster_name }}.topics =  {{ gcp_to_src_topics_to_replicate }}

# Configure from when the MM2 shuld start replicating data
{{ source_cluster_name }}->{{ dest_cluster_name }}.consumer.auto.offset.reset = latest
{{ dest_cluster_name }}->{{ source_cluster_name }}.consumer.auto.offset.reset = latest

# Sync consumer group offsets
{{ source_cluster_name }}->{{ dest_cluster_name }}.exclude.internal.topics = false
{{ source_cluster_name }}->{{ dest_cluster_name }}.emit.heartbeats.enabled = true
 
{{ dest_cluster_name }}->{{ source_cluster_name }}.exclude.internal.topics = false
{{ dest_cluster_name }}->{{ source_cluster_name }}.emit.heartbeats.enabled = true

# Enable automated consumer offset sync
{{ source_cluster_name }}->{{ dest_cluster_name }}.sync.group.offsets.enabled = true
{{ source_cluster_name }}->{{ dest_cluster_name }}.emit.checkpoints.enabled = true

{{ dest_cluster_name }}->{{ source_cluster_name }}.sync.group.offsets.enabled = true
{{ dest_cluster_name }}->{{ source_cluster_name }}.emit.checkpoints.enabled = true

offset.flush.timeout.ms = 60000

### Configure MM2 to manage corrupt records - start
# retry for at most 10 minutes times waiting up to 30 seconds between consecutive failures
{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.retry.timeout = 600000
{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.retry.timeout = 600000

{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.retry.delay.max.ms = 30000
{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.retry.delay.max.ms = 30000

# log error context along with application logs, but do not include configs and messages
{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.log.enable = true
{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.log.enable = true

{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.log.include.messages = false
{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.log.include.messages = false

# produce error context into the Kafka topic
{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.deadletterqueue.topic.name = dbeng-mm2-forward-deadletterqueue
{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.deadletterqueue.topic.name = dbeng-mm2-backward-deadletterqueue

# Tolerate all errors.
{{ source_cluster_name }}->{{ dest_cluster_name }}.errors.tolerance = all
{{ dest_cluster_name }}->{{ source_cluster_name }}.errors.tolerance = all
### Configure MM2 to manage corrupt records - end

# Forward: src -> tgt
#https://kafka.apache.org/38/documentation/#georeplication:~:text=%7Bsource%7D.consumer.%7Bconsumer_config_name%7D
{{ source_cluster_name }}.consumer.max.poll.records = 20000
{{ source_cluster_name }}.consumer.receive.buffer.bytes = 33554432
{{ source_cluster_name }}.consumer.send.buffer.bytes = 33554432
{{ source_cluster_name }}.consumer.max.partition.fetch.bytes = 33554432

{{ source_cluster_name }}.producer.message.max.bytes = 37755000
{{ source_cluster_name }}.producer.compression.type = gzip
{{ source_cluster_name }}.producer.max.request.size = 26214400
{{ source_cluster_name }}.producer.buffer.memory = 524288000
{{ source_cluster_name }}.producer.batch.size = 524288

# Backward: tgt -> src
{{ dest_cluster_name }}.consumer.max.poll.records = 20000
{{ dest_cluster_name }}.consumer.receive.buffer.bytes = 33554432
{{ dest_cluster_name }}.consumer.send.buffer.bytes = 33554432
{{ dest_cluster_name }}.consumer.max.partition.fetch.bytes = 33554432

{{ dest_cluster_name }}.producer.message.max.bytes = 37755000
{{ dest_cluster_name }}.producer.compression.type = gzip
{{ dest_cluster_name }}.producer.max.request.size = 26214400
{{ dest_cluster_name }}.producer.buffer.memory = 524288000
{{ dest_cluster_name }}.producer.batch.size = 524288



# SASL Configurations

#only destination cluster has auth/auth enabled
{{ dest_cluster_name }}.security.protocol=SASL_PLAINTEXT
{{ dest_cluster_name }}.sasl.mechanism=SCRAM-SHA-256
{{ dest_cluster_name }}.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="{{ kafkaAdminUser }}" password="{{ kafka_admin_password }}";

Question

Is there a way to make MM2 skip corrupt records gracefully and continue replicating?
Are we missing any critical config to enable proper dead-letter handling or error resilience?
Has anyone faced a similar issue with corrupt records stalling replication in MM2 active/active setups?

This is causing a critical impact in our production setup, and we’d appreciate any guidance or best practices to handle such scenarios more gracefully.

Thanks in advance!

Ismail · 23 May 2025 07:26

Hi @mmuehlbeyer,

i would appreciate your thought on this as we had a communication long time back on Mirrormaker2 creating heartbeats topics recursively in loop in active-active mode

mmuehlbeyer · 23 May 2025 10:07

Hi @Ismail

happy to be of help

I assume there is no clear error message which corresponds to your error?

Best,
Michael

Ismail · 23 May 2025 10:21

Hi @mmuehlbeyer

thank you for your reply, forgot to add log msg earlier.

here’s the same line of log which are being generated by 2 millions record/minute and our 500MB of log files if filled up in 30 seconds.

{"timestamp":"2025-05-21 11:32:21,821","level":"WARN","logger":"org.apache.kafka.connect.mirror.MirrorSourceTask","thread":"task-thread-MirrorSourceConnector-9","message":"Failure during poll.","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 10"}
{"timestamp":"2025-05-21 11:32:21,821","level":"WARN","logger":"org.apache.kafka.connect.mirror.MirrorSourceTask","thread":"task-thread-MirrorSourceConnector-9","message":"Failure during poll.","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 10"}
{"timestamp":"2025-05-21 11:32:21,821","level":"WARN","logger":"org.apache.kafka.connect.mirror.MirrorSourceTask","thread":"task-thread-MirrorSourceConnector-9","message":"Failure during poll.","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 10"}
{"timestamp":"2025-05-21 11:32:21,821","level":"WARN","logger":"org.apache.kafka.connect.mirror.MirrorSourceTask","thread":"task-thread-MirrorSourceConnector-9","message":"Failure during poll.","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 10"}
{"timestamp":"2025-05-21 11:32:21,821","level":"WARN","logger":"org.apache.kafka.connect.mirror.MirrorSourceTask","thread":"task-thread-MirrorSourceConnector-9","message":"Failure during poll.","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 10"}
{"timestamp":"2025-05-21 11:32:21,821","level":"WARN","logger":"org.apache.kafka.connect.mirror.MirrorSourceTask","thread":"task-thread-MirrorSourceConnector-9","message":"Failure during poll.","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 10"}
{"timestamp":"2025-05-21 11:32:21,821","level":"WARN","logger":"org.apache.kafka.connect.mirror.MirrorSourceTask","thread":"task-thread-MirrorSourceConnector-9","message":"Failure during poll.","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 10"}
{"timestamp":"2025-05-21 11:32:21,821","level":"WARN","logger":"org.apache.kafka.connect.mirror.MirrorSourceTask","thread":"task-thread-MirrorSourceConnector-9","message":"Failure during poll.","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 10"}
{"timestamp":"2025-05-21 11:32:21,821","level":"WARN","logger":"org.apache.kafka.connect.mirror.MirrorSourceTask","thread":"task-thread-MirrorSourceConnector-9","message":"Failure during poll.","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 10"}
{"timestamp":"2025-05-21 11:32:21,821","level":"WARN","logger":"org.apache.kafka.connect.mirror.MirrorSourceTask","thread":"task-thread-MirrorSourceConnector-9","message":"Failure during poll.","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 10"}

mmuehlbeyer · 23 May 2025 11:39

need to check in details later on

though did you check this one

there is a similar issue reported which is related to compressed messages
could this also be the same for you?

Ismail · 23 May 2025 12:02

i think issue was RecordTooLargeException with this one

(Mirror Maker 2 connector trying to write bigger messages than origin - #3 by kuro)

and we are facing completely different exception CorruptRecordException

“message”:“Failure during poll.”,
“stacktrace”:“org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 10”}

i am not sure if both issues are relatable or not?

mmuehlbeyer · 23 May 2025 12:37

just was thinking about a wrongly configured compression which may be
related to the magic byte error

Ismail · 23 May 2025 12:50

yeah maybe, but AFAIK magic bytes is same for version 2.8.0 and 3.8.1 as it was changed with version 4.0

mmuehlbeyer · 26 May 2025 07:28

mmh I think so yes
was just an idea to how to tackle the issue by changing the setting mentioned in the post above

mmuehlbeyer · 2 June 2025 18:16

looked a bit around no definite answer till now

qq: any issues on the source brokers log?

Ismail · 9 June 2025 17:33

@mmuehlbeyer apologies for late response, i was OOO
i checked source broker logs and don’t see any WARN/ERROR

in the meantime, we had another incident where MM2 again received corrupt record and we had to restart the MM2 process on all the node at once to mitigate

mmuehlbeyer · 10 June 2025 07:13

hey @Ismail no worries.
I did not come to a proper way to omit the issue

need to check the next days whether I’m able to reproduce

Ismail · 13 June 2025 11:50

@mmuehlbeyer we checked the destination broker logs, and it seems we have few logs related to corrupt records

and in destination logs, we see log entries for source topics src.topics

{"timestamp":"2025-05-03 21:16:18,026","level":"ERROR","logger":"kafka.server.ReplicaFetcherThread","thread":"ReplicaFetcherThread-0-1007","message":"[ReplicaFetcher replicaId=1009, leaderId=1007, fetcherId=0] Found invalid messages during fetch for partition src.offline-filtered-13 offset 504403279","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Record is corrupt (stored crc = 3924820328) in topic partition src.offline-filtered-13."}
{"timestamp":"2025-05-03 21:21:26,178","level":"ERROR","logger":"kafka.server.ReplicaFetcherThread","thread":"ReplicaFetcherThread-0-1011","message":"[ReplicaFetcher replicaId=1009, leaderId=1011, fetcherId=0] Found invalid messages during fetch for partition src.xyz_offline-filtered-0 offset 1506119098","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 45"}

Node 06 
{"timestamp":"2025-05-03 00:04:49,858","level":"ERROR","logger":"kafka.server.ReplicaFetcherThread","thread":"ReplicaFetcherThread-0-1007","message":"[ReplicaFetcher replicaId=1012, leaderId=1007, fetcherId=0] Found invalid messages during fetch for partition src.stream-10 offset 985227513","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Record is corrupt (stored crc = 3558365576) in topic partition src.stream-10."}
{"timestamp":"2025-05-03 00:07:30,144","level":"ERROR","logger":"kafka.server.ReplicaFetcherThread","thread":"ReplicaFetcherThread-0-1007","message":"[ReplicaFetcher replicaId=1012, leaderId=1007, fetcherId=0] Found invalid messages during fetch for partition src.xyz_offline-13 offset 1124864419","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Record is corrupt (stored crc = 1463375340) in topic partition src.xyz_offline-13."}
{"timestamp":"2025-05-03 00:13:31,376","level":"ERROR","logger":"kafka.server.ReplicaFetcherThread","thread":"ReplicaFetcherThread-0-1011","message":"[ReplicaFetcher replicaId=1012, leaderId=1011, fetcherId=0] Found invalid messages during fetch for partition src.stream-3 offset 904569994","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 46"}
{"timestamp":"2025-05-03 00:13:56,032","level":"ERROR","logger":"kafka.server.ReplicaFetcherThread","thread":"ReplicaFetcherThread-0-1007","message":"[ReplicaFetcher replicaId=1012, leaderId=1007, fetcherId=0] Found invalid messages during fetch for partition src.stream-11 offset 1253956821","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Record is corrupt (stored crc = 3187176967) in topic partition src.stream-11."}
{"timestamp":"2025-05-03 00:14:17,839","level":"ERROR","logger":"kafka.server.ReplicaFetcherThread","thread":"ReplicaFetcherThread-0-1011","message":"[ReplicaFetcher replicaId=1012, leaderId=1011, fetcherId=0] Found invalid messages during fetch for partition xyz-encodings-0 offset 10066599","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Record is corrupt (stored crc = 1116578488) in topic partition xyz-encodings-0."}
{"timestamp":"2025-05-03 00:17:19,519","level":"ERROR","logger":"kafka.server.ReplicaFetcherThread","thread":"ReplicaFetcherThread-0-1008","message":"[ReplicaFetcher replicaId=1012, leaderId=1008, fetcherId=0] Found invalid messages during fetch for partition xyz-changelog-9 offset 7139834","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 45"}
{"timestamp":"2025-05-03 00:18:16,064","level":"ERROR","logger":"kafka.server.ReplicaFetcherThread","thread":"ReplicaFetcherThread-0-1010","message":"[ReplicaFetcher replicaId=1012, leaderId=1010, fetcherId=0] Found invalid messages during fetch for partition reporting-xyz-5 offset 921841061","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 45"}
{"timestamp":"2025-05-03 00:24:58,372","level":"ERROR","logger":"kafka.server.ReplicaFetcherThread","thread":"ReplicaFetcherThread-0-1008","message":"[ReplicaFetcher replicaId=1012, leaderId=1008, fetcherId=0] Found invalid messages during fetch for partition reporting-xyz-stream-21 offset 861196274","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Invalid magic found in record: 102"}
{"timestamp":"2025-05-03 00:25:03,346","level":"ERROR","logger":"kafka.server.ReplicaFetcherThread","thread":"ReplicaFetcherThread-0-1011","message":"[ReplicaFetcher replicaId=1012, leaderId=1011, fetcherId=0] Found invalid messages during fetch for partition src.stream_offline-4 offset 881078260","stacktrace":"org.apache.kafka.common.errors.CorruptRecordException: Record is corrupt (stored crc = 1008256692) in topic partition src.stream_offline-4."}

How to avoid corrupt record being replicated?
Who is corrupting record? is it kafka broker itself or producer who is writing to source broker or Mirrormake2 process itself while replicating events?

i would appreciate your assistance in this issue
TIA!

mmuehlbeyer · 13 June 2025 11:57

hey @Ismail

→ it’s not possible afaik need to check the KIPs though

I guess it’s the producer cause mirrormaker just takes the data as it resides in the topics.
maybe some produced in another format?

best,
michael

Ismail · 3 July 2025 13:46

Hi @mmuehlbeyer

we have observed 1 discrepancy that in 1 of our environment we see 2 types of ERROR logs

{"timestamp":"2025-06-30 15:42:32,515","level":"ERROR","logger":"org.apache.kafka.connect.runtime.rest.RestClient","thread":"ForwardRequestExecutor-src->gcp-1","message":"IO error forwarding REST request: ","stacktrace":"java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: Connect Timeout\n\tat org.eclipse.jetty.client.util.FutureResponseListener.getResult(FutureResponseListener.java:118)\n\tat org.eclipse.jetty.client.util.FutureResponseListener.get(FutureResponseListener.java:101)\n\tat org.eclipse.jetty.client.HttpRequest.send(HttpRequest.java:732)\n\tat org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:175)\n\tat org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:137)\n\tat org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:98)\n\tat org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$publishConnectorTaskConfigs$49(DistributedHerder.java:2268)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:750)\nCaused by: java.net.SocketTimeoutException: Connect Timeout\n\tat org.eclipse.jetty.io.ManagedSelector$Connect.run(ManagedSelector.java:955)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\n\t... 3 more"}

{"timestamp":"2025-06-30 15:42:32,517","level":"ERROR","logger":"org.apache.kafka.connect.runtime.distributed.DistributedHerder","thread":"ForwardRequestExecutor-src->gcp-1","message":"[Worker clientId=src->gcp, groupId=src-mm2] Request to leader to reconfigure connector tasks failed","stacktrace":"org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: IO Error trying to forward REST request: java.net.SocketTimeoutException: Connect Timeout\n\tat org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:194)\n\tat org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:137)\n\tat org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:98)\n\tat org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$publishConnectorTaskConfigs$49(DistributedHerder.java:2268)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:750)\nCaused by: java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: Connect Timeout\n\tat org.eclipse.jetty.client.util.FutureResponseListener.getResult(FutureResponseListener.java:118)\n\tat org.eclipse.jetty.client.util.FutureResponseListener.get(FutureResponseListener.java:101)\n\tat org.eclipse.jetty.client.HttpRequest.send(HttpRequest.java:732)\n\tat org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:175)\n\t... 8 more\nCaused by: java.net.SocketTimeoutException: Connect Timeout\n\tat org.eclipse.jetty.io.ManagedSelector$Connect.run(ManagedSelector.java:955)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\n\t... 3 more"}

and weird thing is same MM2 process with same version is running on other 2 environments and there we don’t have this ERROR

Just for context:
We have destination Kafka on GCP 3.8.1 where our MM2 is running in distributed mode using connect-mirrormaker.sh
and our source Kafka is on on-prem having version 2.8.1 and we have bidirectional replication and replication is working fine without any lag or issue

is it related to this Failure scenario due to missing REST server feature mentioned in post KIP-710: Full support for distributed mode in dedicated MirrorMaker 2.0 clusters - Apache Kafka - Apache Software Foundation

We need ur assistance please, Thanks in advance

mmuehlbeyer · 3 July 2025 14:18

hey @Ismail

just to have the context:
the issue is happening on your 3.8.1 env, right?

cause according to the KIP it should have been fixed with 3.5.

best,
michael

Ismail · 3 July 2025 14:37

@mmuehlbeyer yeah, issue is on 3.8.1 env where our MM2 process is running and its weird that even it got fixed in 3.5.1 its occuring in 3.8.1

mmuehlbeyer · 3 July 2025 14:44

strange
I assume you’re using open source kafka on GCP?
or something different?

mmuehlbeyer · 3 July 2025 14:45

and no firewall which may clock something in place?

Ismail · 8 July 2025 10:45

@mmuehlbeyer sorry for delayed response
yes we’re using open source kafka on GCP and we have 9092, 9093, 2181 port open

Topic		Replies	Views
Apache Kafka MirrorMaker 2 (MM2) bidirectional replication A <-> B Cluster Replication	9	313	31 March 2025
Mirrormaker2 creating heartbeats topics recursively in loop in active-active mode Kafka Connect	53	1764	23 July 2024
Setup Bidirectional Kafka Mirrormaker2 replication Kafka Connect	2	2858	15 December 2023
MirrorMaker 2: topic gets created on target but records stay (or loop) on source Cluster Replication	0	17	24 April 2025
MirrorMaker Endless loop Cluster Replication	5	4965	6 December 2022

MM2 Stops Replicating on Corrupt Record – Need Guidance for Active/Active Setup

Setup

Issue

Question

Related topics