hostAlive: false for 2 KSQL servers out of 3, how to troubleshoot

Greetings all,

I have deployed KSQL using the official helm chart (upgraded to version 7.0.0), and selected 3 replicas. Next I have created a stream which references a Kafka topic as well as a table using the “CREATE TABLE AS SELECT” which groups by the timestamp within the topic and count, i.e. giving the rate of messages in per second over time.

Now, When i try to to a select on this table, i receive the following error:

Exception in thread “main” java.lang.IllegalStateException: KSQL error: {"@type":“statement_error”,“error_code”:40001,“message”:“Unable to execute pull query. [Partition 1 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-swmwg:8088 was not selected because Host is not alive as of time 1637527823646], Partition 2 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-df7f2:8088 was not selected because Host is not alive as of time 1637527823646], Partition 4 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-swmwg:8088 was not selected because Host is not alive as of time 1637527823646], Partition 5 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-df7f2:8088 was not selected because Host is not alive as of time 1637527823646], Partition 7 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-swmwg:8088 was not selected because Host is not alive as of time 1637527823646], Partition 8 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-df7f2:8088 was not selected because Host is not alive as of time 1637527823646], Partition 10 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-swmwg:8088 was not selected because Host is not alive as of time 1637527823646], Partition 11 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-df7f2:8088 was not selected because Host is not alive as of time 1637527823646], Partition 13 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-swmwg:8088 was not selected because Host is not alive as of time 1637527823646], Partition 14 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-df7f2:8088 was not selected because Host is not alive as of time 1637527823646], Partition 16 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-swmwg:8088 was not selected because Host is not alive as of time 1637527823646], Partition 17 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-df7f2:8088 was not selected because Host is not alive as of time 1637527823646], Partition 19 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-swmwg:8088 was not selected because Host is not alive as of time 1637527823646], Partition 20 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-df7f2:8088 was not selected because Host is not alive as of time 1637527823646], Partition 22 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-swmwg:8088 was not selected because Host is not alive as of time 1637527823646], Partition 23 failed to find valid host. Hosts scanned: [ksql-server-84cf6b7477-df7f2:8088 was not selected because Host is not alive as of time 1637527823646]]”,“statementText”:“SELECT ts, count FROM workload_input WHERE ts>=1637527864000;”,“entities”:}

When i check the cluster status i see that 2 of the servers are showing “aliveHost” as false. However, I cannot seem to find the reason for this in the logs or any further information online, how can I troubleshoot this? Are their any common reasons for this?

Hi @morkargel

could provide some details about your setup?
Are all pods up and running?

Do you try to connect from within the K8s cluster?

Best,
Michael

Hi Michael,

Thanks for the reply.

Yes, all pods are up and running. I haven’t changed anything from the helm chart besides updating the image version (cp-helm-charts/README.md at 3ffbdf93dad1baf8a3c9a58a92b8e44bb848cd1c · confluentinc/cp-helm-charts · GitHub). It works with a single pod, but to improve performance of the query I would like to parallelize the query across more workers.

ive create a stream with the timestamp as key:
CREATE OR REPLACE STREAM events
(id VARCHAR, sd DOUBLE, dt DOUBLE, ts BIGINT KEY)
WITH (KAFKA_TOPIC=‘events’,VALUE_FORMAT=‘JSON’,TIMESTAMP=‘ts’);

Ive created streaming aggregation query as table:
CREATE OR REPLACE TABLE workload AS
SELECT ts, COUNT(*) AS count
FROM events
GROUP BY ts;

Looking at the cluster info we get:

{
“clusterStatus”: {
“ksql-server-84cf6b7477-5jm8d:8088”: {
“hostAlive”: false,
“lastStatusUpdateMs”: 1637588881075,
“activeStandbyPerQuery”: {
“CTAS_WORKLOAD_INPUT_1”: {
“activeStores”: [
“Aggregate-Aggregate-Materialize”
],
“activePartitions”: [
{
“topic”: “input”,
“partition”: 8
},
{
“topic”: “input”,
“partition”: 23
},
{
“topic”: “input”,
“partition”: 5
},
{
“topic”: “input”,
“partition”: 20
},
{
“topic”: “input”,
“partition”: 2
},
{
“topic”: “input”,
“partition”: 17
},
{
“topic”: “input”,
“partition”: 14
},
{
“topic”: “input”,
“partition”: 11
}
],
“standByStores”: ,
“standByPartitions”:
}
},
“hostStoreLags”: {
“stateStoreLags”: {},
“updateTimeMs”: 0
}
},
“ksql-server-84cf6b7477-rpvgj:8088”: {
“hostAlive”: false,
“lastStatusUpdateMs”: 1637588881075,
“activeStandbyPerQuery”: {
“CTAS_WORKLOAD_INPUT_1”: {
“activeStores”: [
“Aggregate-Aggregate-Materialize”
],
“activePartitions”: [
{
“topic”: “input”,
“partition”: 7
},
{
“topic”: “input”,
“partition”: 22
},
{
“topic”: “input”,
“partition”: 4
},
{
“topic”: “input”,
“partition”: 19
},
{
“topic”: “input”,
“partition”: 1
},
{
“topic”: “input”,
“partition”: 16
},
{
“topic”: “input”,
“partition”: 13
},
{
“topic”: “input”,
“partition”: 10
}
],
“standByStores”: ,
“standByPartitions”:
}
},
“hostStoreLags”: {
“stateStoreLags”: {},
“updateTimeMs”: 0
}
},
“ksql-server-84cf6b7477-dv5rj:8088”: {
“hostAlive”: true,
“lastStatusUpdateMs”: 1637588787018,
“activeStandbyPerQuery”: {
“CTAS_WORKLOAD_INPUT_1”: {
“activeStores”: [
“Aggregate-Aggregate-Materialize”
],
“activePartitions”: [
{
“topic”: “input”,
“partition”: 6
},
{
“topic”: “input”,
“partition”: 21
},
{
“topic”: “input”,
“partition”: 3
},
{
“topic”: “input”,
“partition”: 18
},
{
“topic”: “input”,
“partition”: 0
},
{
“topic”: “input”,
“partition”: 15
},
{
“topic”: “input”,
“partition”: 12
},
{
“topic”: “input”,
“partition”: 9
}
],
“standByStores”: ,
“standByPartitions”:
}
},
“hostStoreLags”: {
“stateStoreLags”: {},
“updateTimeMs”: 0
}
}
}
}

Regards,
Morgan.

Hi Morgan,

thanks for providing the information.
Did you check the logs? If yes any errors there?

What does helm status <your_release_name> say?

Best,
Michael