We are getting lot of communication link failure errors in mariadb galera cluster from the application at random times after migrating from centos7 to rockylinux9.
Mariadb version : 10.5.18
Galera version: 26.4.13
Errors:
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
…
Caused by: java.sql.SQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown.
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:110)
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:89)
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:63)
at com.mysql.cj.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:1848)
This used to work fine in centos7 with no errors. Nothing in the application or database end has changes except the OS i.e. centos7 to rockylinux9
Migrations/upgrades from CentOS 7 are not supported. Suggest clean install Rocky 9, configure MariaDB galera and restore your databases.
Alternatively, check your MariaDB configuration and ask on the MariaDB forums since this is a MariaDB problem and not Rocky Linux. But since Rocky doesn’t support upgrades, our advice is as mentioned - clean install and restore your databases.
Yes we did a clean install of rockylinux and restored the databases from mariabackup. The mariadb version, configurations and server settings everything is same between the two OS. But we are only seeing this issue on rocky linux when we reverted back to centOS we did not find the random communication link failure issues anymore.
So, we are trying to find if anyone else has been facing this issue.
If selinux isn’t blocking, and firewalld isn’t blocking port communication, then it would suggest MariaDB is the problem somewhere in it’s configuration.
Probably not much help, as we don’t use Java, but we do have multiple MariaDB 10.11 and Galera 26.4.16 clusters, and none of them are seeing any sort of “communication link failure” issues with our various Splunk (actually Splunk uses Java come to think of it), Python, FreeRADIUS, and other apps. We do however use ProxySQL 2.2.2 in front of all of our clusters, so maybe that’s the main difference? Are you using any sort of HA in front of the connectors?